[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication

Vishmi Money (JIRA) Thu, 06 Mar 2014 03:05:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922311#comment-13922311
 ]


Vishmi Money edited comment on LUCENE-5422 at 3/6/14 11:03 AM:
---------------------------------------------------------------

Dmitry Kan , Otis Gospodnetic ,
Thank you very much for your explanations and now I got a clear idea about the 
two issues. As new documents are added segments are merged to the index but, if 
some documents are deleted, we have to keep track on those using skip entries. 
Meanwhile we have to preserve or improve the performance of the operation. That 
is the area which is discussed in LUCENE-2082. 
In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same 
posting list also providing wildcard support. Main objective is to save space. 
Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 
relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment 
merging. This is the idea I got and please let me know if I am wrong on 
something.

Currently I am following LUCENE-4.7.0 documentation and also being familiar 
with the source code and coding conventions. I also follow Michael McCandless's 
Blog and read few posts related like, Visualizing Lucene's segment merges, 
Building a new Lucene posting format etc. I also started reading "LUCENE In 
Action-second edition" book but then I noticed that it is for LUCENE-3.0. As 
LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether 
all the content of the book is relavent or not. Shall I proceed with the 
reading or should I only have to look on documentation for LUCENE-4.0 or above?


was (Author: vishmi money):
Dmitry Kan, Otis Gospodnetic,
Thank you very much for your explanations and now I got a clear idea about the 
two issues. As new documents are added segments are merged to the index but, if 
some documents are deleted, we have to keep track on those using skip entries. 
Meanwhile we have to preserve or improve the performance of the operation. That 
is the area which is discussed in LUCENE-2082. 
In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same 
posting list also providing wildcard support. Main objective is to save space. 
Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 
relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment 
merging. This is the idea I got and please let me know if I am wrong on 
something.

Currently I am following LUCENE-4.7.0 documentation and also being familiar 
with the source code and coding conventions. I also follow Michael McCandless's 
Blog and read few posts related like, Visualizing Lucene's segment merges, 
Building a new Lucene posting format etc. I also started reading "LUCENE In 
Action-second edition" book but then I noticed that it is for LUCENE-3.0. As 
LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether 
all the content of the book is relavent or not. Shall I proceed with the 
reading or should I only have to look on documentation for LUCENE-4.0 or above?

> Postings lists deduplication
> ----------------------------
>
>                 Key: LUCENE-5422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/index
>            Reporter: Dmitry Kan
>              Labels: gsoc2014
>
> The context:
> http://markmail.org/thread/tywtrjjcfdbzww6f
> Robert Muir and I have discussed what Robert eventually named "postings
> lists deduplication" at Berlin Buzzwords 2013 conference.
> The idea is to allow multiple terms to point to the same postings list to
> save space. This can be achieved by new index codec implementation, but this 
> jira is open to other ideas as well.
> The application / impact of this is positive for synonyms, exact / inexact
> terms, leading wildcard support via storing reversed term etc.
> For example, at the moment, when supporting exact (unstemmed) and inexact 
> (stemmed)
> searches, we store both unstemmed and stemmed variant of a word form and
> that leads to index bloating. That is why we had to remove the leading
> wildcard support via reversing a token on index and query time because of
> the same index size considerations.
> Comment from Mike McCandless:
> Neat idea!
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
> Comment from Robert Muir:
> I think the exact/inexact is trickier (detecting it would be the hard
> part), and you are right, another solution might work better.
> but for the reverse wildcard and synonyms situation, it seems we could even
> detect it on write if we created some hash of the previous terms postings.
> if the hash matches for the current term, we know it might be a "duplicate"
> and would have to actually do the costly check they are the same.
> maybe there are better ways to do it, but it might be a fun postingformat
> experiment to try.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication

Reply via email to