I didn't think, it was just a spontaneous reaction :) At the moment I am using static dictionaries to at least get a grip on size of stored fields (escaping encoded terms)
Re: Global Maybe the trick would be to somehow use term dictionary as it must be *eventually* updated? An idea is to write raw token stream for atomicity and reduce it later in compaction phase (e.g on lucene commit())... no matter what we plan do, TL compaction is going to be needed? It is slightly "moving target" problem (TL chases term dictionary), but I am sure, benefits can be huge. compacted TL entry would need to have a pointer to Term[] used to encode it, but this is by all means doable, just simple Term[]. It surely makes not much sense for high cardinality fields, but if you have something with low cardinality (indexed and stored) on a big (100Mio) collection, this reduces space by exorbitant amounts. I do not know, just trying to build upon the fact that we have term dictionary updated in any case--- This works not only for transaction logging, but also for (Analyzed)->{Stored , indexed} fields. By the way, I never look how our term vectors work, keeping reference to token or verbatim term copy? On Fri, Sep 9, 2011 at 12:31 PM, Andrzej Bialecki <a...@getopt.org> wrote: > On 09/09/2011 12:07, eks dev wrote: >> >> +1 >> indeed! All possibilities are are needed. >> >> One might do wild things if it is somehow typed. For example, >> dictionary compression for fields that are tokenized (not only >> stored), as we already have Term dictionary supporting ord-s. Keeping >> just a map Token<-> ord with transaction log... > > Hmm, you mean a per-doc map? because a global map would have to be updated > as we add new docs, which would make the writing process non-atomic, which > is the last thing you want from a transaction log :) > > As a per-doc compression, sure. In fact, what you describe is essentially a > single doc mini-index, because the map is a term dict, the token streams > with ords are postings, etc. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org