[Nutch-dev] [jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Rod Taylor (JIRA) Thu, 30 Mar 2006 15:10:53 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372588 ]


Rod Taylor commented on NUTCH-171:
----------------------------------

"One thing that's needed is the ability to mark urls as "being fetched", which 
was in 0.7 but has not yet made it into 0.8. In addition, we need to be able to 
prioritize jobs."

Agreed. Ideally I could say a maximum of X simultaneous fetch map tasks to be 
executed simultaneously.

This would allow other work to happen in the background and along with the 
bandwidth limiter patch (per task) it would allow a specific amount of 
bandwidth to be used.



"Ideally crawling should work something like:
1. generate segment 1
2. start fetching segment 1
3. generate segment 2;
4. wait for segment 1 fetch to complete
5. start fetching segment 2;
6. update db with output from fetch 1
7. generate segment 3;
8. wait for segment 2 fetch to complete"

This could work, but with 1 Billion URLs in the database generate and update 
both take a significant amount of time. Hate to see what it will be like with 
more than that.

Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 
10M in size. A single 200M URL segment is unweildly from an error management 
perspective.  I actually prefer 1M URL segments.


Ditto for updatedb. Updating 20 segments of 10M URLs in size is pretty much as 
fast as dealing with a single 10M segment.


Ideally, in my eyes:

1) Generate a batch of segments (a few days worth of fetching) -- Xa
2) Fetch Xa/2 segments (literally run 2 of these at once -- have Hadoop limit 
number of simultaneous MAP jobs)
3) UpdateDB for Xa/2 segments
4) Generate a new batch of segments -- Xb
5) Fetch Xa/2 (second half of first set) and Xb/2 (first half of second set)
6) Single UpdateDB for segments Xa/2 and Xb/2
7) All of Xa have been completed. Complete the job on these (merge into 1 unit 
and index or whatever else needs to be done)

Hadoop should also be given the ability to limit the number of jobs of a single 
type: MapFetch -> X, ReduceFetch -> Y, MapGenerate -> Z, etc. AND give a 
priority based on job type. MapFetch is more important than ReduceFetch, which 
is more important than pretty much anything else.


> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have 
> multiple independent segments to work with (lower overhead) -- then run 
> update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided 
> segments again.
> Radu Mateescu wrote the attached patch for us with the below description 
> (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of 
> reduce tasks in order to generate a given number of fetch lists. Basically, 
> what it does is this: before the second reduce (map-reduce is applied twice 
> for generate), it sets the number of reduce tasks to numFetchers and ideally, 
> because each reduce will create a file like part-00000, part-00001, etc in 
> the ndfs, we'll end up with the number of desired fetched lists. But this 
> behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments 
> somebody wants to create. The number of reduce tasks should be chosen based 
> on the physical topology rather then the number of segments someone might 
> want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, 
> the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough 
> values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Reply via email to