[Nutch-dev] Does the data size in 0.8 vesion should be much smaller than in version 0.7?

2006-01-11 Thread Rafi Iz
Hi, I am running few cycles of fetching on nutch 0.8 and I notice that the data size is much smaller than the data size I got in version 0.7 (running the same cycle about the same time from different machines), about 5G after the third cycle starting with about 72000 URLs . All the processes e

[Nutch-dev] [jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ] Rod Taylor commented on NUTCH-171: -- Overhead of generate/update versus fetch is the big one. A smaller segment size fits easily into memory reducing the amount of disk accesse

[Nutch-dev] [jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ] Doug Cutting commented on NUTCH-171: I'd like to hear more about why you want multiple segments, what's motivating this patch. The 0.7 -numFetchers parameter was designed

[Nutch-dev] Bug - Freezes if the last line in the url file does not finish with EOL symbol

2006-01-11 Thread Mike Alulin
The crawler freezes if the last line in the url file does not finish with EOL symbol. System info: OS: Windows XP JDK: 1.5 Nutch build date: 2006-01-06 - Yahoo! Photos Got holiday prints? See all the ways to get qualit

[Nutch-dev] [jira] Updated: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=all ] Rod Taylor updated NUTCH-171: - Attachment: multi_segment.patch Perhaps -numFetchers should be renamed to -numSegments ? > Bring back multiple segment support for Generate / Update > ---

[Nutch-dev] [jira] Created: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Rod Taylor (JIRA)
Bring back multiple segment support for Generate / Update - Key: NUTCH-171 URL: http://issues.apache.org/jira/browse/NUTCH-171 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Rod Taylor

[Nutch-dev] Re: weird fetcher behavior

2006-01-11 Thread Florent Gluck
Thanks for your answers Doug, it makes more sense now. I'm still puzzled about why the number of DB_fetched changes so much when using different number for the map/reduce task settings. I'm gonna inspect the logs and see if I can track down what's going on. Also, I tried to use protocol-http rather

[Nutch-dev] Re: Problem with latest SVN during reduce phase

2006-01-11 Thread Dominik Friedrich
I got this exception a lot, too. I haven't tested the patch by Andrzej yet but instead I just put the doc.add() lines in the indexer reduce function in a try-catch block . This way the indexing finishes even with a null value and i can see which documents haven't been indexed in the log file.

[Nutch-dev] Re: Problem with latest SVN during reduce phase

2006-01-11 Thread Andrzej Bialecki
Byron Miller wrote: Pulled todays build and got above error. No problems running out of disk space or anything like that. This is a single instance, local file systems. You need a patch that I circulated a couple of days ago, about copying the segment name and score from content.metadata

[Nutch-dev] Re: Crawl and parse exceptions

2006-01-11 Thread Matt Zytaruk
Unfortunately, the logs have since been overwritten by nutch so I can't check them, but I am pretty sure those are actually the messages from the task tracker log on the remote machine. If I am remembering correctly, all that was shown on the master was a short exception saying the child failed

[Nutch-dev] Problem with latest SVN during reduce phase

2006-01-11 Thread Byron Miller
60111 103432 reduce > reduce 060111 103432 Optimizing index. 060111 103433 closing > reduce 060111 103434 closing > reduce 060111 103435 closing > reduce java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.(Field.java:469) at org.apache.lucene.do

[Nutch-dev] [jira] Commented: (NUTCH-170) Crash with multiple temp directories

2006-01-11 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362482 ] Doug Cutting commented on NUTCH-170: I have sucessfully used mapred.local.dir with multiple values on many occasions. Can you please try to distill this to an easy to repro

[Nutch-dev] Re: Crawl and parse exceptions

2006-01-11 Thread Doug Cutting
Matt Zytaruk wrote: Exception in thread "main" java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) This is an error returned from an RPC call. There should be more details about this in a

[Nutch-dev] Re: weird fetcher behavior

2006-01-11 Thread Doug Cutting
Florent Gluck wrote: When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for CrawlDb: crawldb 060110 171347 TOTAL urls: 27939 060110 171347 avg score:1.011 060110 171347 max score:8.883 060110 171347 min score:1

[Nutch-dev] [jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-11 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron resolved NUTCH-151: -- Resolution: Fixed Changes committed : http://svn.apache.org/viewcvs.cgi?rev=368060&view=rev Thanks Paul. > CommandRunner can hang after the main thread

[Nutch-dev] weird fetcher behavior

2006-01-11 Thread Florent Gluck
Hi, I'm running nutch trunk as of today. I have 3 slaves and a master. I'm using mapred.map.tasks=20 and mapred.reduce.tasks=4. There is something I'm really confused about. When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for CrawlDb:

[Nutch-dev] PluginManifestParser should be NutchConfigurable

2006-01-11 Thread Jack Tang
Hi I think it is reasonable that PluginManifestParser should implement NutchConfigurable interface. As the NutchConfigurable interface described, PluginManifestParser need NutchConf. /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars

[Nutch-dev] [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-11 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362447 ] Stefan Groschupf commented on NUTCH-169: >I wonder what is the performance impact of this patch - in many places, where >previously we used the static methods on classe

[Nutch-dev] [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-11 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362438 ] Andrzej Bialecki commented on NUTCH-169: - Overall, good work! I have some comments regarding the details: * I wonder what is the performance impact of this patch - in