[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922553#comment-13922553 ]
Otis Gospodnetic commented on LUCENE-5422: ------------------------------------------ Maybe [~mikemccand] can comment, but I think think you are right as far as Codecs part of Lucene and LIA are concerned. > Postings lists deduplication > ---------------------------- > > Key: LUCENE-5422 > URL: https://issues.apache.org/jira/browse/LUCENE-5422 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/index > Reporter: Dmitry Kan > Labels: gsoc2014 > > The context: > http://markmail.org/thread/tywtrjjcfdbzww6f > Robert Muir and I have discussed what Robert eventually named "postings > lists deduplication" at Berlin Buzzwords 2013 conference. > The idea is to allow multiple terms to point to the same postings list to > save space. This can be achieved by new index codec implementation, but this > jira is open to other ideas as well. > The application / impact of this is positive for synonyms, exact / inexact > terms, leading wildcard support via storing reversed term etc. > For example, at the moment, when supporting exact (unstemmed) and inexact > (stemmed) > searches, we store both unstemmed and stemmed variant of a word form and > that leads to index bloating. That is why we had to remove the leading > wildcard support via reversing a token on index and query time because of > the same index size considerations. > Comment from Mike McCandless: > Neat idea! > Would this idea allow a single term to point to (the union of) N other > posting lists? It seems like that's necessary e.g. to handle the > exact/inexact case. > And then, to produce the Docs/AndPositionsEnum you'd need to do the > merge sort across those N posting lists? > Such a thing might also be do-able as runtime only wrapper around the > postings API (FieldsProducer), if you could at runtime do the reverse > expansion (e.g. stem -> all of its surface forms). > Comment from Robert Muir: > I think the exact/inexact is trickier (detecting it would be the hard > part), and you are right, another solution might work better. > but for the reverse wildcard and synonyms situation, it seems we could even > detect it on write if we created some hash of the previous terms postings. > if the hash matches for the current term, we know it might be a "duplicate" > and would have to actually do the costly check they are the same. > maybe there are better ways to do it, but it might be a fun postingformat > experiment to try. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org