Hi,
I am running few cycles of fetching on nutch 0.8 and I notice that the data
size is much smaller than the data size I got in version 0.7 (running the
same cycle about the same time from different machines), about 5G after the
third cycle starting with about 72000 URLs .
All the processes e
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ]
Rod Taylor commented on NUTCH-171:
--
Overhead of generate/update versus fetch is the big one. A smaller segment size
fits easily into memory reducing the amount of disk accesse
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ]
Doug Cutting commented on NUTCH-171:
I'd like to hear more about why you want multiple segments, what's motivating
this patch. The 0.7 -numFetchers parameter was designed
The crawler freezes if the last line in the url file does not finish with EOL
symbol.
System info:
OS: Windows XP
JDK: 1.5
Nutch build date: 2006-01-06
-
Yahoo! Photos
Got holiday prints? See all the ways to get qualit
[ http://issues.apache.org/jira/browse/NUTCH-171?page=all ]
Rod Taylor updated NUTCH-171:
-
Attachment: multi_segment.patch
Perhaps -numFetchers should be renamed to -numSegments ?
> Bring back multiple segment support for Generate / Update
> ---
Bring back multiple segment support for Generate / Update
-
Key: NUTCH-171
URL: http://issues.apache.org/jira/browse/NUTCH-171
Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Rod Taylor
Thanks for your answers Doug, it makes more sense now.
I'm still puzzled about why the number of DB_fetched changes so much
when using different number for the map/reduce task settings.
I'm gonna inspect the logs and see if I can track down what's going on.
Also, I tried to use protocol-http rather
I got this exception a lot, too. I haven't tested the patch by Andrzej
yet but instead I just put the doc.add() lines in the indexer reduce
function in a try-catch block . This way the indexing finishes even with
a null value and i can see which documents haven't been indexed in the
log file.
Byron Miller wrote:
Pulled todays build and got above error. No problems
running out of disk space or anything like that. This
is a single instance, local file systems.
You need a patch that I circulated a couple of days ago, about copying
the segment name and score from content.metadata
Unfortunately, the logs have since been overwritten by nutch so I can't
check them, but I am pretty sure those are actually the messages from
the task tracker log on the remote machine. If I am remembering
correctly, all that was shown on the master was a short exception saying
the child failed
60111 103432 reduce > reduce
060111 103432 Optimizing index.
060111 103433 closing > reduce
060111 103434 closing > reduce
060111 103435 closing > reduce
java.lang.NullPointerException: value cannot be null
at
org.apache.lucene.document.Field.(Field.java:469)
at
org.apache.lucene.do
[
http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362482 ]
Doug Cutting commented on NUTCH-170:
I have sucessfully used mapred.local.dir with multiple values on many occasions.
Can you please try to distill this to an easy to repro
Matt Zytaruk wrote:
Exception in thread "main" java.io.IOException: Not a file:
/user/nutch/segments/20060107130328/parse_data/part-0/data
at org.apache.nutch.ipc.Client.call(Client.java:294)
This is an error returned from an RPC call. There should be more
details about this in a
Florent Gluck wrote:
When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls: 27939
060110 171347 avg score:1.011
060110 171347 max score:8.883
060110 171347 min score:1
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
Jerome Charron resolved NUTCH-151:
--
Resolution: Fixed
Changes committed : http://svn.apache.org/viewcvs.cgi?rev=368060&view=rev
Thanks Paul.
> CommandRunner can hang after the main thread
Hi,
I'm running nutch trunk as of today. I have 3 slaves and a master. I'm
using mapred.map.tasks=20 and mapred.reduce.tasks=4.
There is something I'm really confused about.
When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb:
Hi
I think it is reasonable that PluginManifestParser should implement
NutchConfigurable interface. As the NutchConfigurable interface
described, PluginManifestParser need NutchConf.
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362447 ]
Stefan Groschupf commented on NUTCH-169:
>I wonder what is the performance impact of this patch - in many places, where
>previously we used the static methods on classe
[
http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362438 ]
Andrzej Bialecki commented on NUTCH-169:
-
Overall, good work! I have some comments regarding the details:
* I wonder what is the performance impact of this patch - in
19 matches
Mail list logo