I didn't think, it was just a spontaneous reaction :)

At the moment I am using static dictionaries to at least get a grip on
size of stored fields (escaping encoded terms)

Re: Global
Maybe the trick would be to somehow use term dictionary as it  must be
*eventually* updated? An idea is to write raw token stream for
atomicity and reduce it later in compaction phase (e.g on lucene
commit())... no matter what we plan do, TL compaction  is going to be
needed?

It is slightly "moving target" problem (TL chases term dictionary),
but I am sure, benefits can be huge. compacted TL entry would need to
have a pointer to Term[] used to encode it, but this is by all means
doable, just simple Term[].

It surely makes not much sense for high cardinality fields, but if you
have something with low cardinality (indexed and stored) on a big
(100Mio) collection, this reduces space by exorbitant amounts.


I do not know, just trying to build upon the fact that we have term
dictionary updated in any case---


This  works not only for transaction logging, but also for
(Analyzed)->{Stored , indexed} fields. By the way, I never look how
our term vectors work, keeping reference to token or verbatim term
copy?





On Fri, Sep 9, 2011 at 12:31 PM, Andrzej Bialecki <a...@getopt.org> wrote:
> On 09/09/2011 12:07, eks dev wrote:
>>
>> +1
>> indeed! All possibilities are are needed.
>>
>> One might do wild things if it is somehow  typed. For example,
>> dictionary compression for fields that are tokenized (not only
>> stored), as we already have Term dictionary supporting ord-s. Keeping
>> just a map Token<->  ord with transaction log...
>
> Hmm, you mean a per-doc map? because a global map would have to be updated
> as we add new docs, which would make the writing process non-atomic, which
> is the last thing you want from a transaction log :)
>
> As a per-doc compression, sure. In fact, what you describe is essentially a
> single doc mini-index, because the map is a term dict, the token streams
> with ords are postings, etc.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to