On 09/09/2011 13:20, eks dev wrote:
I didn't think, it was just a spontaneous reaction :)

At the moment I am using static dictionaries to at least get a grip on
size of stored fields (escaping encoded terms)

Re: Global
Maybe the trick would be to somehow use term dictionary as it  must be
*eventually* updated? An idea is to write raw token stream for
atomicity and reduce it later in compaction phase (e.g on lucene
commit())... no matter what we plan do, TL compaction  is going to be
needed?

Compaction - not sure, it would have to preserve the ordering of ops. But some form of primitive compression - certainly, delta coding, vints, etc, anything that can be done per doc, without the need to use data that spans more than 1 record.


It is slightly "moving target" problem (TL chases term dictionary),
but I am sure, benefits can be huge. compacted TL entry would need to
have a pointer to Term[] used to encode it, but this is by all means
doable, just simple Term[].

It surely makes not much sense for high cardinality fields, but if you
have something with low cardinality (indexed and stored) on a big
(100Mio) collection, this reduces space by exorbitant amounts.


I do not know, just trying to build upon the fact that we have term
dictionary updated in any case---

If the tlog has a Commit op, then you could theoretically compact all preceding entries ... at least their term dicts. If you compacted the postings, too, then you would essentially have a multi-doc index ("naked segment"), but it would not be a transaction log anymore, because the update ordering wouldn't be preserved (e.g. intermediate Delete ops would have a different effect).



This  works not only for transaction logging, but also for
(Analyzed)->{Stored , indexed} fields. By the way, I never look how
our term vectors work, keeping reference to token or verbatim term
copy?

It's like term dict + postings, terms are delta front coded like the main term dictionary. It does not reuse terms from the main dict, I think this representation was chosen to avoid ord renumbering when the main term dict is updated - you would have to renumber all term vectors on each commit...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to