Hello,
I was reading Doug's MapReduce document with great interest. I am really interested to get MapReduce nutch implementation working, as our current approach (we have custom WebDB implementation) does not scale. I have no chance to dig through nutch MapReduce code yet so some of my comments may be not valid or simply stupid but I will try to provide some feedback.
1) As I understand we will have no concept of duplication of data in WebDB - there would be no files sorted by MD5. Correct?
2) In current format description there is no "score" field in PageDB. We will probably need one if we are going to use PageRank by default. Or we can define only basic fields in PageDB and allow one to add other fields as suggested ealier in this thread. Than PageRank score can be added as extension to this structure. If such mechanism would be provided it would be great - because we would be able to add other "scores" for the page e.g. classification score for the page - it might be useful for people who implement domain specific search engines.
3) In LinkDB description - order of elements in key is different from the one used in processing steps (5). <destSegment, destURL> <-> <destURL, destSegment>.
4) In Example of directory structure we have several "parts" in PageDB.
I understand these part-* files are temporary files for MapReduce tasks - used for partitioning the data? Am I correct?
If yes - so if fetching of segment 0 is finished the structure would look like:
segment/0/fetchIn/part-0
/fetchOut/reduceddata
/content/reduceddata
Where reduced data id sorted by url?
If these part-* files are not temporary - how data would be split among these files? Sequentially?
I am planning to invest some time in reading map reduce code - and if I could be of any help later I will be available.
Regards Piotr
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
