[ https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-171. ----------------------------------- Resolution: Won't Fix Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki > Bring back multiple segment support for Generate / Update > --------------------------------------------------------- > > Key: NUTCH-171 > URL: https://issues.apache.org/jira/browse/NUTCH-171 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.8 > Reporter: Rod Taylor > Assignee: Andrzej Bialecki > Priority: Minor > Fix For: 1.0.0 > > Attachments: multi_segment.patch > > > We find it convenient to be able to run generate once for -topN 300M and have > multiple independent segments to work with (lower overhead) -- then run > update on all segments which succeeded simultaneously. > This reactivates -numFetchers and fixes updatedb to handle multiple provided > segments again. > Radu Mateescu wrote the attached patch for us with the below description > (lightly edited): > The implementation of -numFetchers in 0.8 improperly plays with the number of > reduce tasks in order to generate a given number of fetch lists. Basically, > what it does is this: before the second reduce (map-reduce is applied twice > for generate), it sets the number of reduce tasks to numFetchers and ideally, > because each reduce will create a file like part-00000, part-00001, etc in > the ndfs, we'll end up with the number of desired fetched lists. But this > behaviour is incorrect for the following reasons: > 1. the number of reduce tasks is orthogonal to the number of segments > somebody wants to create. The number of reduce tasks should be chosen based > on the physical topology rather then the number of segments someone might > want in ndfs > 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, > the numFetchers seems to be ignored > > Therefore , I changed this behaviour to work like this: > - generate will create numFetchers segments > - each reduce task will write in all segments (assuming there are enough > values to be written) in a round-robin fashion > The end results for 3 reduce tasks and 2 segments will look like this : > > /opt/nutch/bin>./nutch ndfs -ls segments > 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122228 Client connection to 192.168.0.1:5466: starting > 060111 122228 No FS indicated, using default:master:5466 > Found 2 items > /user/root/segments/20060111122144-0 <dir> > /user/root/segments/20060111122144-1 <dir> > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate > 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122318 No FS indicated, using default:master:5466 > 060111 122318 Client connection to 192.168.0.1:5466: starting > Found 3 items > /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276 > /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289 > /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858 > > /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate > 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml > 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml > 060111 122334 Client connection to 192.168.0.1:5466: starting > 060111 122334 No FS indicated, using default:master:5466 > Found 3 items > /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207 > /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236 > /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.