On 09/09/2011 11:00, Simon Willnauer wrote:
I created LUCENE-3424 for this. But I still would like to keep the
discussion open here rather than moving this entirely to an issue.
There is more about this than only the seq. ids.

I'm concerned also about the content of the transaction log. In Solr it uses javabin-encoded UpdateCommand-s (either SolrInputDocuments or Delete/Commit commands). Documents in the log are raw documents, i.e. before analysis.

This may have some merits for Solr (e.g. you could imagine having different analysis chains on the Solr slaves), but IMHO it's more of a hassle for Lucene, because it means that the analysis has to be repeated over and over again on all clients. If the analysis chain is costly (e.g. NLP) then it would make sense to have an option to log documents post-analysis, i.e. as correctly typed stored values (e.g. string -> numeric) AND the resulting TokenStream-s. This has also the advantage of moving us towards the "dumb IndexWriter" concept, i.e. separating analysis from the core inverted index functionality.

So I'd argue for recording post-analysis docs in the tlog, either exclusively or as a default option.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to