Hello,

I was reading Doug's MapReduce document with great interest. I am really
interested to get MapReduce nutch implementation working, as our current
  approach (we have custom WebDB implementation) does not scale. I have
no chance to dig through nutch MapReduce code yet so some of my comments
may be not valid or simply stupid but I will try to provide some feedback.

1) As I understand we will have no concept of duplication of data in
WebDB - there would be no files sorted by MD5. Correct?

2) In current format description there is no "score" field in PageDB.
We will probably need one if we are going to use PageRank by default.
Or we can define only basic fields in PageDB and allow one to add other
fields as suggested ealier in this thread. Than PageRank score can be
added as extension to this structure. If such mechanism would be
provided it would be great - because we would be able to add other
"scores" for the page e.g. classification score for the page - it might
be useful for people who implement domain specific search engines.

3) In LinkDB description - order of elements in key is different from
the one used in processing steps (5). <destSegment, destURL> <->
<destURL, destSegment>.

4) In Example of directory structure we have several "parts" in PageDB.
I understand these part-* files are temporary files for MapReduce tasks - used for partitioning the data? Am I correct?
If yes - so if fetching of segment 0 is finished the structure would look like:
segment/0/fetchIn/part-0
/fetchOut/reduceddata
/content/reduceddata
Where reduced data id sorted by url?


If these part-* files are not temporary - how data would be split among these files? Sequentially?


I am planning to invest some time in reading map reduce code - and if I could be of any help later I will be available.


Regards
Piotr


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to