[email protected] wrote:
Hi John,

This was my experience, too. If I've interpreted the source code correctly, the time in 
merging is spent on sorting, which is required because the segments are assumed to be 
"random", possibly containing duplicated URLs. The sort process groups URLs 
together and allows to choose the one to include in the merge result.

I think that if there were a simplified version of merge available, that assumed that all segments come from same crawl and are in "good" completed state, this merge would be very fast, because no sorting would be required. It would be very useful, too, because it seems that this "simple" use is what people need.

The output from segment merger has to be sorted anyway - most segment data is stored in Hadoop MapFile-s, and MapFile requires the keys to be sorted in ascending order.

We could perhaps speed this up by using the same approach as the raw iterators (inside Hadoop core) use, i.e. to open all input parts, and then advance the pointers in those readers that have the smallest key - this way we get an on-the-fly merge-sort, IFF individual parts are already sorted (which we know they are).

This would be similar to the solution implemented here: https://issues.apache.org/jira/browse/HADOOP-2834

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to