Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: I have committed this, along with the LuceneQueryOptimizer changes. I could only find one place where I was using numDocs() instead of maxDoc(). Right, I confused two bugs from different files - the other bug still exists in the committed version of the LuceneQueryOpti

NullPointerException (new as of Dec 31st)

2006-01-02 Thread Rod Taylor
During a fetch I have recently started getting these (pretty consistently). task_r_5m9ybr 0.15 reduce > copy > java.lang.NullPointerException at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991) at java.lang.Float.parseFloat(Float.java:394) at org.apache.nutch.parse

[jira] Created: (NUTCH-161) Plain text parser should use parser.character.encoding.default property for fall back encoding

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
Plain text parser should use parser.character.encoding.default property for fall back encoding -- Key: NUTCH-161 URL: http://issues.apache.org/jira/browse/NUTCH-161 Project: Nutch

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perc

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. p

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361552 ] KuroSaka TeruHiko commented on NUTCH-138: - Sorry, my oversight, useBodyEncodingForURI did not work as I expected. Setting URIEncoding is the only way. I'll write this

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: [EMAIL PROTECTED] wrote: Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implemen

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] Piotr Kosiorowski commented on NUTCH-138: - BTW - just create user for yourself in nutch Wiki and you shoudl be able to add a new page with information without problems.

[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=all ] Piotr Kosiorowski closed NUTCH-138: --- Resolution: Invalid Setting URIEncoding in tomcat config file fixes the problem. > non-Latin-1 characters cannot be submitted for search > -

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361546 ] KuroSaka TeruHiko commented on NUTCH-138: - You are right. WIth this Tomcat config, UTF-8 characters can be passed. Also works is having: useBodyEncodingForURI="true"

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually bett

Re: [bug?] PRC called emthod require parameter

2006-01-02 Thread Doug Cutting
Stefan Groschupf wrote: I also note this line in client.java public Writable[] call(Writable[] params, InetSocketAddress[] addresses) throws IOException { if (params.length == 0) return new Writable[0]; Do I understand it correct that in case the remote method does not need any paramet

Re: Bug in DeleteDuplicates.java ?

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Gal Nitzan wrote: this function throws IOException. Why? public long getPos() throws IOException { return (doc*INDEX_LENGTH)/maxDoc; } It should be throwing ArithmeticException The IOException is required by the API of RecordReader. W

Re: java.io.IOException: Job failed

2006-01-02 Thread Doug Cutting
Gal Nitzan wrote: I am using trunk. while trying to crawl I get the following: [ ...] 050825 100222 task_m_ns3ehv Error running child 050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero 050825 100222 task_m_ns3ehv at org.apache.nutch.indexer.DeleteDuplicates $1.getPos(De

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-02 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature,

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361545 ] byron miller commented on NUTCH-159: While it's from the mapred trunk, it is a non ndfs/local instance only. Mapred.temp.dir was left at it's defaults.. (which didn't exis

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ] Doug Cutting commented on NUTCH-159: mapred.local.dir is the thing to set. if that fails, then there is a bug. what did you have this set to? > Specify temp/working dire

Re: Trunk is broken

2006-01-02 Thread Thomas Jaeger
Hi Andrzej, Gal Nitzan wrote: >> It seems that Trunk is now broken... >> DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. TJ

Re: Trunk is broken

2006-01-02 Thread Thomas Jaeger
Hi Andrzej, Gal Nitzan wrote: > It seems that Trunk is now broken... > DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. TJ

Re: Mega-cleanup in trunk/

2006-01-02 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Andrzej Bialecki wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... Hi, I am not sure what is wrong but a lot of JUnit test

[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] Piotr Kosiorowski commented on NUTCH-138: - I am not sure but I would suspect it is a problem of bad tomcat configuration. To handle special characters in query urls one