Re: Initial work on multi word synonyms and phrase queries

Michael McCandless Fri, 19 Jun 2015 14:44:31 -0700

Ahh, thanks for bringing closure.

Others do successfully run tests from Intellij, I think, so I'm not
sure why you see intermittent issues...



Mike McCandless

http://blog.mikemccandless.com


On Fri, Jun 19, 2015 at 5:10 PM, Ian <ianri...@hotmail.com> wrote:
> The problem with the tests were actually because of the IDE (Intellij).
> Running the tests with ant directly works just fine. Just thought I would
> have this registered for the record.
>
> ________________________________
> From: ianri...@hotmail.com
> To: dev@lucene.apache.org
> Subject: RE: Initial work on multi word synonyms and phrase queries
> Date: Thu, 18 Jun 2015 11:53:23 +0000
>
>
> Issue opened: https://issues.apache.org/jira/browse/LUCENE-6582.
>
> @rcmuir, that change on the test is actually a leftover from one of my
> previous solutions while exploring the problem. It is no longer necessary
> and I removed it from the patch added to the issue above.
>
> To explain a little, in an earlier solution, the current inputs were always
> the first tokens on the output, even if there were longer synonyms (in
> number of terms). That created an inconsistency between position increments
> and position lengths, as I wasn't sure I could have a position increment
> grater than 1. So I changed it to have the first tokens, the ones that
> actually increment the positions, come from the longer synonym. In this way,
> the token stream has the same behavior as before: whenever the position
> increment is 1, the position length is also 1. But that means that, when
> keepOriginal = true and there are synonyms with more terms than the input,
> the original input (tokens with type="word") will come, on the output
> stream,  "stacked" on top of synonym tokens. This seemed to me less likely
> to impact elsewhere.
>
> Glad to hear you also deem that code complicated. I was assuming it was hard
> to me because I'm a beginner on the code base ;-)
>
> About the failing tests, in my setup, they are flaky. Sometimes passing
> sometimes failing, and not always the same. But always complaining of
> missing postings formats (last time it was 'FST50'). I'll look around a
> little more to see if I can figure out what's wrong.
>
> Ian
>
>> From: luc...@mikemccandless.com
>> Date: Thu, 18 Jun 2015 06:02:09 -0400
>> Subject: Re: Initial work on multi word synonyms and phrase queries
>> To: dev@lucene.apache.org; ianri...@hotmail.com
>>
>> +1 to opening an issue, thanks for exploring this! It's hairy :)
>>
>> Your windows test failures complaining about FSTOrd50 missing is
>> curious ... I don't run Windows but maybe someone who does has an
>> idea? That postings format comes from lucene/codecs which should be
>> on the class path during tests...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <rcm...@gmail.com> wrote:
>> > Hey, thanks for tackling this! That synonymfilter is a beast...
>> >
>> > Can you open a JIRA issue with your patch?
>> >
>> > To me the interesting part is this change in the test:
>> >
>> > if (posInc > 0) {
>> > // This token increments position, so it is starting a new position.
>> > // Its position is the last position plus the posLength of the
>> > // last token that started a position.
>> > pos += lastPosLength;
>> > lastPosLength = posLength;
>> > }
>> >
>> > This currently implies some change to how posInc/posLen are treated on
>> > the consumer side: it would need changes to queryparsers and
>> > indexwriter to work (which is fine, we could figure out those
>> > semantics). But its my understanding this logic might be based on some
>> > properties specific to synonymfilter being greedy, and not really
>> > general to all streams. So maybe it synonymfilter or some other filter
>> > needs to do this adjustment internally instead.
>> >
>> > Anyway, I think we should make an issue and investigate it.
>> >
>> > On Wed, Jun 17, 2015 at 9:56 PM, Ian <ianri...@hotmail.com> wrote:
>> >> Hello,
>> >>
>> >> Some time ago, I had a problem with synonyms and phrase type queries
>> >> (actually, it was elasticsearch and I was using a match query with
>> >> multiple
>> >> terms and the "and" operator, as better explained here:
>> >> https://github.com/elastic/elasticsearch/issues/10394).
>> >>
>> >> That issue led to some work on Lucene:
>> >> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a
>> >> little
>> >> with tests) and https://issues.apache.org/jira/browse/LUCENE-6401. This
>> >> issue is also related to
>> >> https://issues.apache.org/jira/browse/LUCENE-3843.
>> >>
>> >> Starting from the discussion on LUCENE-6400, I'm attempting to
>> >> implement a
>> >> solution. Here is a patch with a first step - the implementation to fix
>> >> "SynFilter to be able to 'make positions'" (as was mentioned on the
>> >> issue).
>> >> In this way, the synonym filter generates a correct (or, at least,
>> >> better)
>> >> graph.
>> >>
>> >> As the synonym matching is greedy, I only had to worry about fixing the
>> >> position length of the rules of the current match, no future or past
>> >> synonyms would "span" over this match (please correct me if I'm
>> >> wrong!). It
>> >> did require more buffering, twice as much.
>> >>
>> >> The new behavior I added is not active by default, a new parameter has
>> >> to be
>> >> passed in a new constructor for SynonymFilter. The changes I made do
>> >> change
>> >> the token stream generated by the synonym filter, and I thought it
>> >> would be
>> >> better to let that be a voluntary decision for now.
>> >>
>> >> I did some refactoring on the code, but mostly on what I had to change
>> >> for
>> >> may implementation, so that the patch was not too hard to read. I
>> >> created
>> >> specific unit tests for the new implementation
>> >> (TestMultiWordSynonymFilter)
>> >> that should show how things will be with the new behavior.
>> >>
>> >> Speaking of tests, I ran "analysis-common" tests locally (windows 8,
>> >> java
>> >> 8), and had only 2 unrelated failures (as far as I can tell)
>> >> complaining of
>> >> missing PostingsFormat "FSTOrd50".
>> >>
>> >> Thanks for any help, comment, adjustment on the patch. I'll do my best
>> >> to
>> >> make the necessary adjustments.
>> >>
>> >> Please forgive me if I did not follow any rule, of the code or of the
>> >> list,
>> >> and I would be grateful to be able to learn from my mistakes.
>> >>
>> >> Regards,
>> >> Ian
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Initial work on multi word synonyms and phrase queries

Reply via email to