Hi, i am having problem implementing unsort for crawler in map/reduce.
I have list of URLs waiting to fetch, they needs to be reordered for
maximum distance between URLs from one domain.
idea is to do
map URL -> domain, URL
test.com, http://www.test.com/page1.html
test.com, http://www.test.com/page2.html
test.com, http://www.test.com/page3.html
test2.com, http://www.test2.com/page1.html
test2.com, http://www.test2.com/page2.html
test2.com, http://www.test2.com/page3.html
reduce test.com, <list> -> priority, URL
10, http://www.test.com/page1.html
9, http://www.test.com/page2.html
8, http://www.test.com/page3.html
10, http://www.test2.com/page1.html
9, http://www.test2.com/page2.html
8, http://www.test2.com/page3.html
Now i need to order output by key
10, http://www.test.com/page1.html
10, http://www.test2.com/page1.html
9, http://www.test.com/page2.html
9, http://www.test2.com/page2.html
8, http://www.test.com/page3.html
8, http://www.test2.com/page3.html
and write list of URLs in this order to output files. Like 50k urls to
file1, next 50k to file2 and so on.
Can you give me an idea how to sort using mapred and how to process
sorted data and split them into files?