Hi, i am having problem implementing unsort for crawler in map/reduce.

I have list of URLs waiting to fetch, they needs to be reordered for maximum distance between URLs from one domain.

idea is to do
 map URL -> domain, URL

 test.com, http://www.test.com/page1.html
 test.com, http://www.test.com/page2.html
 test.com, http://www.test.com/page3.html
 test2.com, http://www.test2.com/page1.html
 test2.com, http://www.test2.com/page2.html
 test2.com, http://www.test2.com/page3.html

 reduce test.com, <list> -> priority, URL

10, http://www.test.com/page1.html
 9, http://www.test.com/page2.html
 8, http://www.test.com/page3.html
10, http://www.test2.com/page1.html
 9, http://www.test2.com/page2.html
 8, http://www.test2.com/page3.html


Now i need to order output by key

10, http://www.test.com/page1.html
10, http://www.test2.com/page1.html
 9, http://www.test.com/page2.html
 9, http://www.test2.com/page2.html
 8, http://www.test.com/page3.html
 8, http://www.test2.com/page3.html

and write list of URLs in this order to output files. Like 50k urls to file1, next 50k to file2 and so on.

Can you give me an idea how to sort using mapred and how to process sorted data and split them into files?

Reply via email to