Dennis, You should consider splitting based on a function that will give you a more uniform distribution (e.g.: MD5(url)). That way, you should see much less of a variation in the number of documents per partition.
Chad On 12/4/06 3:29 PM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote: > All, > > We have two version of a type of index splitter. The first version > would run an indexing job and then using the completed index as input > would read the number of documents in the index and take a requested > split size. From this it used a custom index input format to create > splits according to document id. We would run a job that would map out > index urls as keys and documents with their ids wrapped in a > SerializableWritable object as the values. Then inside of a second job > using the index as input we would have a MapRunner that would read the > other supporting databases (linkdb, segments) and map all objects as > ObjectWritables. Then on the reduce we had a custom Output and > OutputFormat that took all of the objects and wrote out the databases > and indexes into each split. > > There was a problem with this first approach though in that writing out > an index from a previously serialized document would lose any fields > that are not stored (which is most). So we went with a second approach. > > The second approach takes a number of splits and runs through an > indexing job on the fly. It calls the indexing and scoring filters. It > uses the linkdb, crawldb, and segments as input. As it indexes is also > splits the databases and indexes into the number of reduce tasks so that > the final output is multiple splits each hold a part of the index and > its supporting databases. Each of the databases holds only the > information for the urls that are in its part of the index. These parts > can then be pushed to separate search servers. This type of splitting > works well but you can NOT define a specific number of documents or urls > per split and sometimes one split will have alot more urls than another > if you are indexing some sites that have alot of pages (i.e. wikipedia > or cnn archives). This is currently how our system works. We fetch, > invert links, run through some other processes, and then index and split > on the fly. Then we use python scripts to pull each split directly from > the DFS to each search server and then start the search servers. > > We are still working on the splitter because the ideal approach would be > to be able to specify a number of documents per split as well as to > group by different keys, not just url. I would be happy to share the > current code but it is highly integrated so I would need to pull it out > of our code base first. It would be best if I could send it to someone, > say Andrzej, to take a look at first. > > Dennis > > Andrzej Bialecki wrote: >> Dennis Kubes wrote: >> >> [...] >>> Having a new index on each machine and having to create separate >>> indexes is not the most elegant way to accomplish this architecture. >>> The best way that we have found is to have an splitter job that >>> indexes and splits the index and >> >> Have you implemented a Lucene index splitter, i.e. a tool that takes >> an existing Lucene index and splits it into parts by document id? This >> sounds very interesting - could you tell us a bit about this? >>
