Why not do something very simple: Use the MD5 of the URL as the key you do the sorting by. This scales very easy and highly randomized order. Maybe not the optimal maximum distance, but certainly a very good distribution and very easy to built.
Niels Basjes 2011/10/25 Radim Kolar <h...@sendmail.cz> > Hi, i am having problem implementing unsort for crawler in map/reduce. > > I have list of URLs waiting to fetch, they needs to be reordered for > maximum distance between URLs from one domain. > > idea is to do > map URL -> domain, URL > > test.com, http://www.test.com/page1.html > test.com, http://www.test.com/page2.html > test.com, http://www.test.com/page3.html > test2.com, http://www.test2.com/page1.**html<http://www.test2.com/page1.html> > test2.com, http://www.test2.com/page2.**html<http://www.test2.com/page2.html> > test2.com, http://www.test2.com/page3.**html<http://www.test2.com/page3.html> > > reduce test.com, <list> -> priority, URL > > 10, http://www.test.com/page1.html > 9, http://www.test.com/page2.html > 8, http://www.test.com/page3.html > 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html> > 9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html> > 8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html> > > > Now i need to order output by key > > 10, http://www.test.com/page1.html > 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html> > 9, http://www.test.com/page2.html > 9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html> > 8, http://www.test.com/page3.html > 8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html> > > and write list of URLs in this order to output files. Like 50k urls to > file1, next 50k to file2 and so on. > > Can you give me an idea how to sort using mapred and how to process sorted > data and split them into files? > -- Best regards / Met vriendelijke groeten, Niels Basjes