[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication

Dmitry Kan (JIRA) Mon, 10 Mar 2014 12:29:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13926081#comment-13926081
 ]


Dmitry Kan edited comment on LUCENE-5422 at 3/10/14 7:27 PM:
-------------------------------------------------------------

I agree with [~mikemccand] in that the issue should be better scoped. The case 
with compressing stemmed / non-stemmed terms posting lists is quite tricky and 
requires more thought.

One clear case for this issue is storing reversed term along with its original 
non-reversed version. Both should point to the same posting list (subject to 
some after-stemming-hash-check).

What do you guys think?


was (Author: dmitry_key):
I agree with [~mikemccand] in that the issue should be better scoped. The case 
with compressing stemmed / non-stemmed terms posting lists is quite tricky and 
requires more thought.

One clear case for this issue is storing reversed term along with it is 
original non-reversed version. Both should point to the same posting list 
(subject to some after-stemming-hash-check).

What do you guys think?

> Postings lists deduplication
> ----------------------------
>
>                 Key: LUCENE-5422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/index
>            Reporter: Dmitry Kan
>              Labels: gsoc2014
>
> The context:
> http://markmail.org/thread/tywtrjjcfdbzww6f
> Robert Muir and I have discussed what Robert eventually named "postings
> lists deduplication" at Berlin Buzzwords 2013 conference.
> The idea is to allow multiple terms to point to the same postings list to
> save space. This can be achieved by new index codec implementation, but this 
> jira is open to other ideas as well.
> The application / impact of this is positive for synonyms, exact / inexact
> terms, leading wildcard support via storing reversed term etc.
> For example, at the moment, when supporting exact (unstemmed) and inexact 
> (stemmed)
> searches, we store both unstemmed and stemmed variant of a word form and
> that leads to index bloating. That is why we had to remove the leading
> wildcard support via reversing a token on index and query time because of
> the same index size considerations.
> Comment from Mike McCandless:
> Neat idea!
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
> Comment from Robert Muir:
> I think the exact/inexact is trickier (detecting it would be the hard
> part), and you are right, another solution might work better.
> but for the reverse wildcard and synonyms situation, it seems we could even
> detect it on write if we created some hash of the previous terms postings.
> if the hash matches for the current term, we know it might be a "duplicate"
> and would have to actually do the costly check they are the same.
> maybe there are better ways to do it, but it might be a fun postingformat
> experiment to try.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication

Reply via email to