Re: distributed search

Andrzej Bialecki Tue, 05 Dec 2006 01:29:54 -0800

Dennis Kubes wrote:

All,
We have two version of a type of index splitter. The first versionwould run an indexing job and then using the completed index as inputwould read the number of documents in the index and take a requestedsplit size. From this it used a custom index input format to createsplits according to document id. We would run a job that would mapout index urls as keys and documents with their ids wrapped in aSerializableWritable object as the values. Then inside of a secondjob using the index as input we would have a MapRunner that would readthe other supporting databases (linkdb, segments) and map all objectsas ObjectWritables. Then on the reduce we had a custom Output andOutputFormat that took all of the objects and wrote out the databasesand indexes into each split.
There was a problem with this first approach though in that writingout an index from a previously serialized document would lose anyfields that are not stored (which is most). So we went with a secondapproach.

Yes, that's the nature of the beast - I sort of hoped that youimplemented a true splitter, which directly splits term posting listsaccording to doc id. This is possible, and doesn't require using storedfields - the only problem being that someone well acquainted with Luceneinternals needs to write it .. ;)

The second approach takes a number of splits and runs through anindexing job on the fly. It calls the indexing and scoring filters.It uses the linkdb, crawldb, and segments as input. As it indexes isalso splits the databases and indexes into the number of reduce tasksso that the final output is multiple splits each hold a part of theindex and its supporting databases. Each of the databases holds onlythe information for the urls that are in its part of the index. Theseparts can then be pushed to separate search servers. This type ofsplitting works well but you can NOT define a specific number ofdocuments or urls per split and sometimes one split will have alotmore urls than another if you

Why? it depends on the partitioner, or the output format that you areusing. E.g. in SegmentMerger I implemented an OutputFormat that producesseveral segments simultaneously, writing to their respective partsdepending on a value of metadata. This value may be used to switchsequentially between output "slices" so that urls are spread evenlyacross the "slices".

are indexing some sites that have alot of pages (i.e. wikipedia or cnnarchives). This is currently how our system works. We fetch, invertlinks, run through some other processes, and then index and split onthe fly. Then we use python scripts to pull each split directly fromthe DFS to each search server and then start the search servers.
We are still working on the splitter because the ideal approach wouldbe to be able to specify a number of documents per split as well as togroup by different keys, not just url. I would be happy to share thecurrent code but it is highly integrated so I would need to pull itout of our code base first. It would be best if I could send it tosomeone, say Andrzej, to take a look at first.

Unless I'm missing something, SegmentMerger.SegmentOutputFormat shouldsatisfy these requirements, you would just need to modify the jobimplementation ...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: distributed search

Reply via email to