Why not do something very simple: Use the MD5 of the URL as the key you do
the sorting by.
This scales very easy and highly randomized order.
Maybe not the optimal maximum distance, but certainly a very good
distribution and very easy to built.

Niels Basjes

2011/10/25 Radim Kolar <h...@sendmail.cz>

> Hi, i am having problem implementing unsort for crawler in map/reduce.
>
> I have list of URLs waiting to fetch, they needs to be reordered for
> maximum distance between URLs from one domain.
>
> idea is to do
>  map URL -> domain, URL
>
>  test.com, http://www.test.com/page1.html
>  test.com, http://www.test.com/page2.html
>  test.com, http://www.test.com/page3.html
>  test2.com, http://www.test2.com/page1.**html<http://www.test2.com/page1.html>
>  test2.com, http://www.test2.com/page2.**html<http://www.test2.com/page2.html>
>  test2.com, http://www.test2.com/page3.**html<http://www.test2.com/page3.html>
>
>  reduce test.com, <list> -> priority, URL
>
> 10, http://www.test.com/page1.html
>  9, http://www.test.com/page2.html
>  8, http://www.test.com/page3.html
> 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html>
>  9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html>
>  8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html>
>
>
> Now i need to order output by key
>
> 10, http://www.test.com/page1.html
> 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html>
>  9, http://www.test.com/page2.html
>  9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html>
>  8, http://www.test.com/page3.html
>  8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html>
>
> and write list of URLs in this order to output files. Like 50k urls to
> file1, next 50k to file2 and so on.
>
> Can you give me an idea how to sort using mapred and how to process sorted
> data and split them into files?
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to