Dennis Kubes wrote:
All,
We have two version of a type of index splitter. The first version
would run an indexing job and then using the completed index as input
would read the number of documents in the index and take a requested
split size. From this it used a custom index input format to create
splits according to document id. We would run a job that would map
out index urls as keys and documents with their ids wrapped in a
SerializableWritable object as the values. Then inside of a second
job using the index as input we would have a MapRunner that would read
the other supporting databases (linkdb, segments) and map all objects
as ObjectWritables. Then on the reduce we had a custom Output and
OutputFormat that took all of the objects and wrote out the databases
and indexes into each split.
There was a problem with this first approach though in that writing
out an index from a previously serialized document would lose any
fields that are not stored (which is most). So we went with a second
approach.
Yes, that's the nature of the beast - I sort of hoped that you
implemented a true splitter, which directly splits term posting lists
according to doc id. This is possible, and doesn't require using stored
fields - the only problem being that someone well acquainted with Lucene
internals needs to write it .. ;)
The second approach takes a number of splits and runs through an
indexing job on the fly. It calls the indexing and scoring filters.
It uses the linkdb, crawldb, and segments as input. As it indexes is
also splits the databases and indexes into the number of reduce tasks
so that the final output is multiple splits each hold a part of the
index and its supporting databases. Each of the databases holds only
the information for the urls that are in its part of the index. These
parts can then be pushed to separate search servers. This type of
splitting works well but you can NOT define a specific number of
documents or urls per split and sometimes one split will have alot
more urls than another if you
Why? it depends on the partitioner, or the output format that you are
using. E.g. in SegmentMerger I implemented an OutputFormat that produces
several segments simultaneously, writing to their respective parts
depending on a value of metadata. This value may be used to switch
sequentially between output "slices" so that urls are spread evenly
across the "slices".
are indexing some sites that have alot of pages (i.e. wikipedia or cnn
archives). This is currently how our system works. We fetch, invert
links, run through some other processes, and then index and split on
the fly. Then we use python scripts to pull each split directly from
the DFS to each search server and then start the search servers.
We are still working on the splitter because the ideal approach would
be to be able to specify a number of documents per split as well as to
group by different keys, not just url. I would be happy to share the
current code but it is highly integrated so I would need to pull it
out of our code base first. It would be best if I could send it to
someone, say Andrzej, to take a look at first.
Unless I'm missing something, SegmentMerger.SegmentOutputFormat should
satisfy these requirements, you would just need to modify the job
implementation ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com