[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13926081#comment-13926081 ]
Dmitry Kan edited comment on LUCENE-5422 at 3/10/14 7:27 PM: ------------------------------------------------------------- I agree with [~mikemccand] in that the issue should be better scoped. The case with compressing stemmed / non-stemmed terms posting lists is quite tricky and requires more thought. One clear case for this issue is storing reversed term along with its original non-reversed version. Both should point to the same posting list (subject to some after-stemming-hash-check). What do you guys think? was (Author: dmitry_key): I agree with [~mikemccand] in that the issue should be better scoped. The case with compressing stemmed / non-stemmed terms posting lists is quite tricky and requires more thought. One clear case for this issue is storing reversed term along with it is original non-reversed version. Both should point to the same posting list (subject to some after-stemming-hash-check). What do you guys think? > Postings lists deduplication > ---------------------------- > > Key: LUCENE-5422 > URL: https://issues.apache.org/jira/browse/LUCENE-5422 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/index > Reporter: Dmitry Kan > Labels: gsoc2014 > > The context: > http://markmail.org/thread/tywtrjjcfdbzww6f > Robert Muir and I have discussed what Robert eventually named "postings > lists deduplication" at Berlin Buzzwords 2013 conference. > The idea is to allow multiple terms to point to the same postings list to > save space. This can be achieved by new index codec implementation, but this > jira is open to other ideas as well. > The application / impact of this is positive for synonyms, exact / inexact > terms, leading wildcard support via storing reversed term etc. > For example, at the moment, when supporting exact (unstemmed) and inexact > (stemmed) > searches, we store both unstemmed and stemmed variant of a word form and > that leads to index bloating. That is why we had to remove the leading > wildcard support via reversing a token on index and query time because of > the same index size considerations. > Comment from Mike McCandless: > Neat idea! > Would this idea allow a single term to point to (the union of) N other > posting lists? It seems like that's necessary e.g. to handle the > exact/inexact case. > And then, to produce the Docs/AndPositionsEnum you'd need to do the > merge sort across those N posting lists? > Such a thing might also be do-able as runtime only wrapper around the > postings API (FieldsProducer), if you could at runtime do the reverse > expansion (e.g. stem -> all of its surface forms). > Comment from Robert Muir: > I think the exact/inexact is trickier (detecting it would be the hard > part), and you are right, another solution might work better. > but for the reverse wildcard and synonyms situation, it seems we could even > detect it on write if we created some hash of the previous terms postings. > if the hash matches for the current term, we know it might be a "duplicate" > and would have to actually do the costly check they are the same. > maybe there are better ways to do it, but it might be a fun postingformat > experiment to try. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org