RE: Initial work on multi word synonyms and phrase queries

Ian Tue, 23 Jun 2015 05:13:21 -0700

Thanks for the information. That workflow seems good: if it works from ant, 
it's ok.


> Date: Sat, 20 Jun 2015 09:19:35 -0700
> Subject: Re: Initial work on multi word synonyms and phrase queries
> From: [email protected]
> To: [email protected]
> 
> I have had both things happen, tests that run fine in IntelliJ fail
> with ant and vice-versa. Not often but occasionally. If it passes when
> run from ant I consider it done. I've never dug too far into that
> anomaly though, but I've guessed it may be related to temp directory
> handling.
> 
> FWIW
> 
> On Fri, Jun 19, 2015 at 2:43 PM, Michael McCandless
> <[email protected]> wrote:
> > Ahh, thanks for bringing closure.
> >
> > Others do successfully run tests from Intellij, I think, so I'm not
> > sure why you see intermittent issues...
> >
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Jun 19, 2015 at 5:10 PM, Ian <[email protected]> wrote:
> >> The problem with the tests were actually because of the IDE (Intellij).
> >> Running the tests with ant directly works just fine. Just thought I would
> >> have this registered for the record.
> >>
> >> ________________________________
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: RE: Initial work on multi word synonyms and phrase queries
> >> Date: Thu, 18 Jun 2015 11:53:23 +0000
> >>
> >>
> >> Issue opened: https://issues.apache.org/jira/browse/LUCENE-6582.
> >>
> >> @rcmuir, that change on the test is actually a leftover from one of my
> >> previous solutions while exploring the problem. It is no longer necessary
> >> and I removed it from the patch added to the issue above.
> >>
> >> To explain a little, in an earlier solution, the current inputs were always
> >> the first tokens on the output, even if there were longer synonyms (in
> >> number of terms). That created an inconsistency between position increments
> >> and position lengths, as I wasn't sure I could have a position increment
> >> grater than 1. So I changed it to have the first tokens, the ones that
> >> actually increment the positions, come from the longer synonym. In this 
> >> way,
> >> the token stream has the same behavior as before: whenever the position
> >> increment is 1, the position length is also 1. But that means that, when
> >> keepOriginal = true and there are synonyms with more terms than the input,
> >> the original input (tokens with type="word") will come, on the output
> >> stream,  "stacked" on top of synonym tokens. This seemed to me less likely
> >> to impact elsewhere.
> >>
> >> Glad to hear you also deem that code complicated. I was assuming it was 
> >> hard
> >> to me because I'm a beginner on the code base ;-)
> >>
> >> About the failing tests, in my setup, they are flaky. Sometimes passing
> >> sometimes failing, and not always the same. But always complaining of
> >> missing postings formats (last time it was 'FST50'). I'll look around a
> >> little more to see if I can figure out what's wrong.
> >>
> >> Ian
> >>
> >>> From: [email protected]
> >>> Date: Thu, 18 Jun 2015 06:02:09 -0400
> >>> Subject: Re: Initial work on multi word synonyms and phrase queries
> >>> To: [email protected]; [email protected]
> >>>
> >>> +1 to opening an issue, thanks for exploring this! It's hairy :)
> >>>
> >>> Your windows test failures complaining about FSTOrd50 missing is
> >>> curious ... I don't run Windows but maybe someone who does has an
> >>> idea? That postings format comes from lucene/codecs which should be
> >>> on the class path during tests...
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <[email protected]> wrote:
> >>> > Hey, thanks for tackling this! That synonymfilter is a beast...
> >>> >
> >>> > Can you open a JIRA issue with your patch?
> >>> >
> >>> > To me the interesting part is this change in the test:
> >>> >
> >>> > if (posInc > 0) {
> >>> > // This token increments position, so it is starting a new position.
> >>> > // Its position is the last position plus the posLength of the
> >>> > // last token that started a position.
> >>> > pos += lastPosLength;
> >>> > lastPosLength = posLength;
> >>> > }
> >>> >
> >>> > This currently implies some change to how posInc/posLen are treated on
> >>> > the consumer side: it would need changes to queryparsers and
> >>> > indexwriter to work (which is fine, we could figure out those
> >>> > semantics). But its my understanding this logic might be based on some
> >>> > properties specific to synonymfilter being greedy, and not really
> >>> > general to all streams. So maybe it synonymfilter or some other filter
> >>> > needs to do this adjustment internally instead.
> >>> >
> >>> > Anyway, I think we should make an issue and investigate it.
> >>> >
> >>> > On Wed, Jun 17, 2015 at 9:56 PM, Ian <[email protected]> wrote:
> >>> >> Hello,
> >>> >>
> >>> >> Some time ago, I had a problem with synonyms and phrase type queries
> >>> >> (actually, it was elasticsearch and I was using a match query with
> >>> >> multiple
> >>> >> terms and the "and" operator, as better explained here:
> >>> >> https://github.com/elastic/elasticsearch/issues/10394).
> >>> >>
> >>> >> That issue led to some work on Lucene:
> >>> >> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a
> >>> >> little
> >>> >> with tests) and https://issues.apache.org/jira/browse/LUCENE-6401. This
> >>> >> issue is also related to
> >>> >> https://issues.apache.org/jira/browse/LUCENE-3843.
> >>> >>
> >>> >> Starting from the discussion on LUCENE-6400, I'm attempting to
> >>> >> implement a
> >>> >> solution. Here is a patch with a first step - the implementation to fix
> >>> >> "SynFilter to be able to 'make positions'" (as was mentioned on the
> >>> >> issue).
> >>> >> In this way, the synonym filter generates a correct (or, at least,
> >>> >> better)
> >>> >> graph.
> >>> >>
> >>> >> As the synonym matching is greedy, I only had to worry about fixing the
> >>> >> position length of the rules of the current match, no future or past
> >>> >> synonyms would "span" over this match (please correct me if I'm
> >>> >> wrong!). It
> >>> >> did require more buffering, twice as much.
> >>> >>
> >>> >> The new behavior I added is not active by default, a new parameter has
> >>> >> to be
> >>> >> passed in a new constructor for SynonymFilter. The changes I made do
> >>> >> change
> >>> >> the token stream generated by the synonym filter, and I thought it
> >>> >> would be
> >>> >> better to let that be a voluntary decision for now.
> >>> >>
> >>> >> I did some refactoring on the code, but mostly on what I had to change
> >>> >> for
> >>> >> may implementation, so that the patch was not too hard to read. I
> >>> >> created
> >>> >> specific unit tests for the new implementation
> >>> >> (TestMultiWordSynonymFilter)
> >>> >> that should show how things will be with the new behavior.
> >>> >>
> >>> >> Speaking of tests, I ran "analysis-common" tests locally (windows 8,
> >>> >> java
> >>> >> 8), and had only 2 unrelated failures (as far as I can tell)
> >>> >> complaining of
> >>> >> missing PostingsFormat "FSTOrd50".
> >>> >>
> >>> >> Thanks for any help, comment, adjustment on the patch. I'll do my best
> >>> >> to
> >>> >> make the necessary adjustments.
> >>> >>
> >>> >> Please forgive me if I did not follow any rule, of the code or of the
> >>> >> list,
> >>> >> and I would be grateful to be able to learn from my mistakes.
> >>> >>
> >>> >> Regards,
> >>> >> Ian
> >>> >>
> >>> >>
> >>> >> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: [email protected]
> >>> >> For additional commands, e-mail: [email protected]
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: [email protected]
> >>> > For additional commands, e-mail: [email protected]
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

RE: Initial work on multi word synonyms and phrase queries

Reply via email to