Sure, it all would work and would be better than "naive" index UID. Mapping more UIDs to one permits this compromise "Number of unique terms in term dict against CPU during update to resolve collisions".
I like Paul's idea with more fields, it reduces number of UIDs in term dictionary, but increases density of postings lists for these terms. It simplifies update as no collisions are possible, just makes it slower. It is all too fiddly and suboptimal, one needs to tone to find an optimum here, but hey, better than naive approach. Both of these solutions are just better way to do it wrong :) The real solution is definitely somewhere around ParallelReader usage. Ideally, one should be able to say by opening index which parts of index he is going to be using. One way to do it is to to create Parallel Indexes, searching part is fully functional and already there. Anyone using ParallelReader, any tips on creating parallel indexes? In my particular case, ParallelReader is not strictly necessary, because I "only" need to filter-out one Field from termDictionary and its Postings during RAMDisk loading. One has some flexibility to do a lot with SwithDirectory, but postings for one field are not in separate files... Thanks for good tips, we found two better solutions for our "UID use cases toolbox" Cheers, eks ----- Original Message ---- > From: Toke Eskildsen <t...@statsbiblioteket.dk> > To: "dev@lucene.apache.org" <dev@lucene.apache.org> > Sent: Fri, 22 October, 2010 0:32:04 > Subject: RE: Polymorphic Index > > From: Mark Harwood [markharw...@yahoo.co.uk] > > Good point, Toke. Forgot about that. Of course doubling the number > > of hash algos used to 4 increases the space massively. > > Maybe your hashing-idea could work even with collisions? > > Using your original two-hash suggestion, we're just about sure to get >collisions. However, we are still able to uniquely identify the right >document >as the UID is also stored (search for the hashes, iterate over the results >and >get the UID for each). When an update is requested for an existing document, >the indexer extracts the UIDs from all the documents that matches the hash. >Then it performs a delete of the hash-terms and re-indexes all the documents >that had "false" collisions. As the number of unique hash-values as well as >hash-function can be adjusted, this could be a nicely tweakable >performance-vs-space trade off. > > This will only work if it is possible to re-create the documents from stored >terms or by requesting the data from outside of Lucene by UID. Is this >possible >with your setup, eks dev? > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org