Perhaps another way of thinking about the problem: Given a large range of IDs (eg your 300 million) you could constrain the number of unique terms using a double-hashing technique e.g. Pick a number "n" for the max number of unique terms you'll tolerate e.g. 1 million and store 2 terms for every primary key using a different hashing function e.g.
int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms. int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms. Then queries to retrieve/delete a record use a search for hashedKey1 AND hashedKey2. The probability of having the same collision on two different hashing functions is minimal and should return the original record only. Obviously you would still have the postings recorded but these would be slightly more compact e.g each of your 1 million unique terms would have ~300 gap-encoded vints entries as opposed to 300m postings of one full int. Cheers Mark On 21 Oct 2010, at 20:44, eks dev wrote: > Hi All, > I am trying to figure out a way to implement following use case with > lucene/solr. > > > In order to support simple incremental updates (master) I need to index and > store UID Field on 300Mio collection. (My UID is a 32 byte sequence). But I > do > not need indexed (only stored) it during normal searching (slaves). > > > The problem is that my term dictionary gets blown away with sheer number of > unique IDs. Number of unique terms on this collection, excluding UID is less > than 7Mio. > I can tolerate resources hit on Updater (big hardware, on disk index...). > > This is a master slave setup, where searchers run from RAMDisk and having > 300Mio * 32 (give or take prefix compression) plus pointers to postings and > postings is something I would really love to avoid as this is significant > compared to really small documents I have. > > > Cutting to the chase: > How I can have Indexed UID field, and when done with indexing: > 1) Load "searchable" index into ram from such an index on disk without one > field? > > 2) create 2 Indices in sync on docIDs, One containing only indexed UID > 3) somehow transform index with indexed UID by droping UID field, preserving > docIs. Kind of tool smart index-editing tool. > > Something else already there i do not know? > > Preserving docIds is crucial, as I need support for lovely incremental > updates > (like in solr master-slave update). Also Stored field should remain! > I am not looking for "use MMAPed Index and let OS deal with it advice"... > I do not mind doing it with flex branch 4.0, nut being in a hurry. > > Thanks in advance, > Eks > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org