Re: Initial work on multi word synonyms and phrase queries

Erick Erickson Sat, 20 Jun 2015 09:20:04 -0700

I have had both things happen, tests that run fine in IntelliJ fail
with ant and vice-versa. Not often but occasionally. If it passes when
run from ant I consider it done. I've never dug too far into that
anomaly though, but I've guessed it may be related to temp directory
handling.


FWIW

On Fri, Jun 19, 2015 at 2:43 PM, Michael McCandless
<[email protected]> wrote:
> Ahh, thanks for bringing closure.
>
> Others do successfully run tests from Intellij, I think, so I'm not
> sure why you see intermittent issues...
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jun 19, 2015 at 5:10 PM, Ian <[email protected]> wrote:
>> The problem with the tests were actually because of the IDE (Intellij).
>> Running the tests with ant directly works just fine. Just thought I would
>> have this registered for the record.
>>
>> ________________________________
>> From: [email protected]
>> To: [email protected]
>> Subject: RE: Initial work on multi word synonyms and phrase queries
>> Date: Thu, 18 Jun 2015 11:53:23 +0000
>>
>>
>> Issue opened: https://issues.apache.org/jira/browse/LUCENE-6582.
>>
>> @rcmuir, that change on the test is actually a leftover from one of my
>> previous solutions while exploring the problem. It is no longer necessary
>> and I removed it from the patch added to the issue above.
>>
>> To explain a little, in an earlier solution, the current inputs were always
>> the first tokens on the output, even if there were longer synonyms (in
>> number of terms). That created an inconsistency between position increments
>> and position lengths, as I wasn't sure I could have a position increment
>> grater than 1. So I changed it to have the first tokens, the ones that
>> actually increment the positions, come from the longer synonym. In this way,
>> the token stream has the same behavior as before: whenever the position
>> increment is 1, the position length is also 1. But that means that, when
>> keepOriginal = true and there are synonyms with more terms than the input,
>> the original input (tokens with type="word") will come, on the output
>> stream,  "stacked" on top of synonym tokens. This seemed to me less likely
>> to impact elsewhere.
>>
>> Glad to hear you also deem that code complicated. I was assuming it was hard
>> to me because I'm a beginner on the code base ;-)
>>
>> About the failing tests, in my setup, they are flaky. Sometimes passing
>> sometimes failing, and not always the same. But always complaining of
>> missing postings formats (last time it was 'FST50'). I'll look around a
>> little more to see if I can figure out what's wrong.
>>
>> Ian
>>
>>> From: [email protected]
>>> Date: Thu, 18 Jun 2015 06:02:09 -0400
>>> Subject: Re: Initial work on multi word synonyms and phrase queries
>>> To: [email protected]; [email protected]
>>>
>>> +1 to opening an issue, thanks for exploring this! It's hairy :)
>>>
>>> Your windows test failures complaining about FSTOrd50 missing is
>>> curious ... I don't run Windows but maybe someone who does has an
>>> idea? That postings format comes from lucene/codecs which should be
>>> on the class path during tests...
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Wed, Jun 17, 2015 at 10:21 PM, Robert Muir <[email protected]> wrote:
>>> > Hey, thanks for tackling this! That synonymfilter is a beast...
>>> >
>>> > Can you open a JIRA issue with your patch?
>>> >
>>> > To me the interesting part is this change in the test:
>>> >
>>> > if (posInc > 0) {
>>> > // This token increments position, so it is starting a new position.
>>> > // Its position is the last position plus the posLength of the
>>> > // last token that started a position.
>>> > pos += lastPosLength;
>>> > lastPosLength = posLength;
>>> > }
>>> >
>>> > This currently implies some change to how posInc/posLen are treated on
>>> > the consumer side: it would need changes to queryparsers and
>>> > indexwriter to work (which is fine, we could figure out those
>>> > semantics). But its my understanding this logic might be based on some
>>> > properties specific to synonymfilter being greedy, and not really
>>> > general to all streams. So maybe it synonymfilter or some other filter
>>> > needs to do this adjustment internally instead.
>>> >
>>> > Anyway, I think we should make an issue and investigate it.
>>> >
>>> > On Wed, Jun 17, 2015 at 9:56 PM, Ian <[email protected]> wrote:
>>> >> Hello,
>>> >>
>>> >> Some time ago, I had a problem with synonyms and phrase type queries
>>> >> (actually, it was elasticsearch and I was using a match query with
>>> >> multiple
>>> >> terms and the "and" operator, as better explained here:
>>> >> https://github.com/elastic/elasticsearch/issues/10394).
>>> >>
>>> >> That issue led to some work on Lucene:
>>> >> https://issues.apache.org/jira/browse/LUCENE-6400 (where I helped a
>>> >> little
>>> >> with tests) and https://issues.apache.org/jira/browse/LUCENE-6401. This
>>> >> issue is also related to
>>> >> https://issues.apache.org/jira/browse/LUCENE-3843.
>>> >>
>>> >> Starting from the discussion on LUCENE-6400, I'm attempting to
>>> >> implement a
>>> >> solution. Here is a patch with a first step - the implementation to fix
>>> >> "SynFilter to be able to 'make positions'" (as was mentioned on the
>>> >> issue).
>>> >> In this way, the synonym filter generates a correct (or, at least,
>>> >> better)
>>> >> graph.
>>> >>
>>> >> As the synonym matching is greedy, I only had to worry about fixing the
>>> >> position length of the rules of the current match, no future or past
>>> >> synonyms would "span" over this match (please correct me if I'm
>>> >> wrong!). It
>>> >> did require more buffering, twice as much.
>>> >>
>>> >> The new behavior I added is not active by default, a new parameter has
>>> >> to be
>>> >> passed in a new constructor for SynonymFilter. The changes I made do
>>> >> change
>>> >> the token stream generated by the synonym filter, and I thought it
>>> >> would be
>>> >> better to let that be a voluntary decision for now.
>>> >>
>>> >> I did some refactoring on the code, but mostly on what I had to change
>>> >> for
>>> >> may implementation, so that the patch was not too hard to read. I
>>> >> created
>>> >> specific unit tests for the new implementation
>>> >> (TestMultiWordSynonymFilter)
>>> >> that should show how things will be with the new behavior.
>>> >>
>>> >> Speaking of tests, I ran "analysis-common" tests locally (windows 8,
>>> >> java
>>> >> 8), and had only 2 unrelated failures (as far as I can tell)
>>> >> complaining of
>>> >> missing PostingsFormat "FSTOrd50".
>>> >>
>>> >> Thanks for any help, comment, adjustment on the patch. I'll do my best
>>> >> to
>>> >> make the necessary adjustments.
>>> >>
>>> >> Please forgive me if I did not follow any rule, of the code or of the
>>> >> list,
>>> >> and I would be grateful to be able to learn from my mistakes.
>>> >>
>>> >> Regards,
>>> >> Ian
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Initial work on multi word synonyms and phrase queries

Reply via email to