[email protected] wrote:
Hi John,
This was my experience, too. If I've interpreted the source code correctly, the time in
merging is spent on sorting, which is required because the segments are assumed to be
"random", possibly containing duplicated URLs. The sort process groups URLs
together and allows to choose the one to include in the merge result.
I think that if there were a simplified version of merge available, that assumed that all segments come from same crawl and are in "good" completed state, this merge would be very fast, because no sorting would be required.
It would be very useful, too, because it seems that this "simple" use is what people need.
The output from segment merger has to be sorted anyway - most segment
data is stored in Hadoop MapFile-s, and MapFile requires the keys to be
sorted in ascending order.
We could perhaps speed this up by using the same approach as the raw
iterators (inside Hadoop core) use, i.e. to open all input parts, and
then advance the pointers in those readers that have the smallest key -
this way we get an on-the-fly merge-sort, IFF individual parts are
already sorted (which we know they are).
This would be similar to the solution implemented here:
https://issues.apache.org/jira/browse/HADOOP-2834
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com