+1 to opening an issue, thanks for exploring this! It's hairy :) Your windows test failures complaining about FSTOrd50 missing is curious ... I don't run Windows but maybe someone who does has an idea? That postings format comes from lucene/codecs which should be on the class path during tests...
Mike McCandless http://blog.mikemccandless.com On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <rcm...@gmail.com> wrote: > Hey, thanks for tackling this! That synonymfilter is a beast... > > Can you open a JIRA issue with your patch? > > To me the interesting part is this change in the test: > > if (posInc > 0) { > // This token increments position, so it is starting a new > position. > // Its position is the last position plus the posLength of the > // last token that started a position. > pos += lastPosLength; > lastPosLength = posLength; > } > > This currently implies some change to how posInc/posLen are treated on > the consumer side: it would need changes to queryparsers and > indexwriter to work (which is fine, we could figure out those > semantics). But its my understanding this logic might be based on some > properties specific to synonymfilter being greedy, and not really > general to all streams. So maybe it synonymfilter or some other filter > needs to do this adjustment internally instead. > > Anyway, I think we should make an issue and investigate it. > > On Wed, Jun 17, 2015 at 9:56 PM, Ian <ianri...@hotmail.com> wrote: >> Hello, >> >> Some time ago, I had a problem with synonyms and phrase type queries >> (actually, it was elasticsearch and I was using a match query with multiple >> terms and the "and" operator, as better explained here: >> https://github.com/elastic/elasticsearch/issues/10394). >> >> That issue led to some work on Lucene: >> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a little >> with tests) and https://issues.apache.org/jira/browse/LUCENE-6401. This >> issue is also related to https://issues.apache.org/jira/browse/LUCENE-3843. >> >> Starting from the discussion on LUCENE-6400, I'm attempting to implement a >> solution. Here is a patch with a first step - the implementation to fix >> "SynFilter to be able to 'make positions'" (as was mentioned on the issue). >> In this way, the synonym filter generates a correct (or, at least, better) >> graph. >> >> As the synonym matching is greedy, I only had to worry about fixing the >> position length of the rules of the current match, no future or past >> synonyms would "span" over this match (please correct me if I'm wrong!). It >> did require more buffering, twice as much. >> >> The new behavior I added is not active by default, a new parameter has to be >> passed in a new constructor for SynonymFilter. The changes I made do change >> the token stream generated by the synonym filter, and I thought it would be >> better to let that be a voluntary decision for now. >> >> I did some refactoring on the code, but mostly on what I had to change for >> may implementation, so that the patch was not too hard to read. I created >> specific unit tests for the new implementation (TestMultiWordSynonymFilter) >> that should show how things will be with the new behavior. >> >> Speaking of tests, I ran "analysis-common" tests locally (windows 8, java >> 8), and had only 2 unrelated failures (as far as I can tell) complaining of >> missing PostingsFormat "FSTOrd50". >> >> Thanks for any help, comment, adjustment on the patch. I'll do my best to >> make the necessary adjustments. >> >> Please forgive me if I did not follow any rule, of the code or of the list, >> and I would be grateful to be able to learn from my mistakes. >> >> Regards, >> Ian >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org