subject:"\[jira\] \[Commented\] \(LUCENE\-6582\) SynonymFilter should generate a correct \(or, at least, better\) graph"

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-07-08 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618511#comment-14618511
]

Michael McCandless commented on LUCENE-6582:

bq. I'm going to now try to simplify SynFilter by removing the hairy graph
flattening it must do today

I opened LUCENE-6664 for this... it seems to work!

SynonymFilter should generate a correct (or, at least, better) graph

Key: LUCENE-6582
URL: https://issues.apache.org/jira/browse/LUCENE-6582
Project: Lucene - Core
Issue Type: Bug
Reporter: Ian Ribas
Attachments: LUCENE-6582.patch, LUCENE-6582.patch, LUCENE-6582.patch,
after.png, after2.png, after3.png, before.png

Some time ago, I had a problem with synonyms and phrase type queries
(actually, it was elasticsearch and I was using a match query with multiple
terms and the and operator, as better explained here:
https://github.com/elastic/elasticsearch/issues/10394).
That issue led to some work on Lucene: LUCENE-6400 (where I helped a little
with tests) and LUCENE-6401. This issue is also related to LUCENE-3843.
Starting from the discussion on LUCENE-6400, I'm attempting to implement a
solution. Here is a patch with a first step - the implementation to fix
SynFilter to be able to 'make positions' (as was mentioned on the
[issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
In this way, the synonym filter generates a correct (or, at least, better)
graph.
As the synonym matching is greedy, I only had to worry about fixing the
position length of the rules of the current match, no future or past synonyms
would span over this match (please correct me if I'm wrong!). It did
require more buffering, twice as much.
The new behavior I added is not active by default, a new parameter has to be
passed in a new constructor for {{SynonymFilter}}. The changes I made do
change the token stream generated by the synonym filter, and I thought it
would be better to let that be a voluntary decision for now.
I did some refactoring on the code, but mostly on what I had to change for
may implementation, so that the patch was not too hard to read. I created
specific unit tests for the new implementation
({{TestMultiWordSynonymFilter}}) that should show how things will be with the
new behavior.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-07-02 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611746#comment-14611746
]

Michael McCandless commented on LUCENE-6582:

Thanks [~ianribas]!

I've opened LUCENE-6638 to create a graph flattener TokenFilter, and it seems
to work well ... I'm going to now try to simplify SynFilter by removing the
hairy graph flattening it must do today, and have it create correct graph
outputs. I think with the combination of these two we can then have 100%
accurate synonym graphs for accurate query-time searches, but also have the
sausagized version that indexing needs to match the graph corruption we do
today.

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-23 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597570#comment-14597570
]

Ian Ribas commented on LUCENE-6582:
---

I'll be glad to help either way. I trust whatever priority you both
decide.Where should I start looking for the next step? QueryBuilder?

bq. some of the complexity in the current one is because it's being forced to
sausagify at the same time as producing synonyms, so if we can more cleanly
split out those two functions maybe we get simpler code

Very good point! It does depend, though, on how the graph is represented, of
course. I'll try to keep this in mind when working on the correct graph (now
or later).

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-23 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597359#comment-14597359
]

Michael McCandless commented on LUCENE-6582:

I agree [~rcmuir], we can do the correct graph as a follow-on issue; this one
is already a good step towards that.

I'll look more closely at this patch.

I can also help with phase 2 ... e.g. maybe build the re-sausagizer filter.
I do think phase 2 may result in a simpler syn filter implementation ... some
of the complexity in the current one is because it's being forced to sausagify
at the same time as producing synonyms, so if we can more cleanly split out
those two functions maybe we get simpler code ... not sure.

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596804#comment-14596804
 ] 

Robert Muir commented on LUCENE-6582:
-

Can we consider mike's crazy graph as a second step? I don't know, it just 
seems like the current patch might be an easier-to-digest improvement. Then 
again I get totally lost in the synonymfilter :)

There is a lot still to do after even this first step before users get real 
improvements, for instance fixing querybuilder for the boolean case (hopefully 
easier now), and improving stuff like sandbox/TermAutomatonQuery (maybe?) for 
the positional case, and so on.

If we spend the work on a hypercorrect graph for synonymfilter I'm just afraid 
of things getting lost, queryparser can't make use of it because posinc/poslen 
semantics become too confusing, analysis is hard to reason about because 
posinc/poslen semantics are too confusing, more performance tradeoffs (like 
graph re-sausaging at index time), ...


 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596772#comment-14596772
]

Ian Ribas commented on LUCENE-6582:
---

Ok, I think I have an idea on how to do this, but it will definitely need some
thinking, so it will probably take a while.

bq. I think that's fine, I think better correctness trumps the added buffering
cost.

I completely agree.

bq. I think if we fix syn filter here to produce the correct graph, we should
also insert a sausagizer phase that turns this graph back into a sausage for
indexing?

I think Robert also commented something on these lines in his answer to my
email. I think I understand the general idea of what that means, but I would
certainly appreciate some guidance, when the time comes. I'll focus on
producing a correct graph first.

This also means that maybe I'll need changes on the test validations, since we
might run into conditions that are considered wrong now. Specially regarding
offsets (start and end) and their relation to position lengths. But I'll see
what I can do about that too.

bq. However, if you apply syn filter at search time, we could fix query parsers
to possibly do the right thing here

I was planning taking a shot at that too, once this part is finished. To make
the solution more complete. And, again, I'll certainly appreciate ideas when
the time comes.

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596490#comment-14596490
]

Michael McCandless commented on LUCENE-6582:

bq. I really hadn't thought of using position lengths as references, like
this!

It's hard to think about :) But it just means the positions become node IDs,
and you must number the nodes properly (so that any token always goes from
node X to Y where Y X).

bq. One problem that I see is that I'll need more buffering

I think that's fine, I think better correctness trumps the added buffering cost.

bq. One other doubt I have is how this affects the indexer. I imagine it saves
position lengths on the index too, so this shouldn't be a problem, right?

The index does NOT record position length today... I think if we fix syn filter
here to produce the correct graph, we should also insert a sausagizer phase
that turns this graph back into a sausage for indexing? (So that what the
fudge and wow that's funny will in fact match a document that had wtf).

However, if you apply syn filter at search time, we could fix query parsers to
possibly do the right thing here, e.g. translating this graph into a union of
phrase queries, or using TermAutomatonQuery (in sandbox still), or something ...

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596318#comment-14596318
]

Ian Ribas commented on LUCENE-6582:
---

I really hadn't thought of using position lengths as references, like this!
For some reason I could only think of position length spanning over terms on
the base stream, not creating independent branches. And so I was stuck in
thinking how to represent the graph correctly once I saw that my
implementation was still a sausage on the overlapping phrases.

I will try to implement this solution and see how far I can get.

One problem that I see is that I'll need more buffering as now the future
output is no longer the size of the longest synonym but the added sizes of all
synonyms that are phrases (each minus one). Getting this number at the
beginning might require a change on the synonym parser...

One other doubt I have is how this affects the indexer. I imagine it saves
position lengths on the index too, so this shouldn't be a problem, right?

Thanks very much for such clear guidelines!

SynonymFilter should generate a correct (or, at least, better) graph

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595017#comment-14595017
 ] 

Michael McCandless commented on LUCENE-6582:


bq.  changed -10 to -1

Thank you!

bq. Unfortunately, I could not merge the tests into a single file because they 
have different base classes and the new tests depend on asserts on the base 
class.

Ugh, OK, thanks for trying.  It's fine to leave it separate...

bq. The absolute positions still have meaning.

OK but with this change we have now created new nodes (something SynFilter does 
not do today), because wtf now has posLen=3.  This is great (necessary!) for 
SynFilter to be correct...

bq. And there is no difference in the representation of the tokens 
unexpensive as a synonym for cheap from wow as a synonym for what on 
the previous rules, using the attributes.

Well, it is entirely possible for PosInc/PosLenAtt to express the correct 
graph, it's just hairy to implement, but I think your patch is part way there 
(it creates new positions!).

E.g. here's a sequence of tokens that would be the fully correct graph output 
for the wtf example:

||token||posInc||posLen||
|wtf|1|5|
|what|0|1|
|wow|0|3|
|the|1|1|
|fudge|1|3|
|that's|1|1|
|funny|1|1|
|happened|1|1|

It corresponds to this graph:

!after3.png!

The token posInc/posLen is just a rote serialization of the arcs of the graph 
based on how the states are numbered, and other numberings would be possible 
resulting in different token outputs because there is inherent ambiguity in how 
you serialize a graph.  I think the only constraints are that 1) all arcs 
leaving a given state must be serialized one after another (exactly like 
Lucene's Automaton class!), 2) an arc from node X must go to another node  X 
(i.e., arcs cannot go to an earlier numbered node).

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592896#comment-14592896
]

Ian Ribas commented on LUCENE-6582:
---

But after this change, the expanded synonyms become separate paths in
the graph right? So it will look like this?
Matching exactly the right phrases?

Actually, I think it looks more like this:

!after2.png!

That means that wtf the fudge and wow happened funny will no longer match,
but wow the fudge would.

The absolute positions still have meaning. I couldn't figure out how to clearly
separate the two phrases from the rules (as was mentioned on the TODO comment
above) using the token attributes. They will be stacked in the same order on
each position, but that doesn't seem to be enough to make things unambiguous.
Specially since with rules like:

{noformat}
pass, ticket
cheap, unexpensive
{noformat}

and tokenizing:

{noformat}
cheap pass
{noformat}

I would expect to mach all the resulting combinations: cheap pass,
unexpensive pass, cheap ticket and unexpensive ticket. And there is no
difference in the representation of the tokens unexpensive as a synonym for
cheap from wow as a synonym for what on the previous rules, using the
attributes.

Why -10?

I was unsure if -1 as an invalid value was clear enough and ended up using -10.
It could probably just be -1. I'll check.

Can we just add the new test cases into the existing (tiny)
TestMultiWordSynonyms.java?

Probably. Since all tests in the new file use the new constructor to force the
new behavior, and TestMultiWordSynonyms tests the old behavior, I didn'tt want
to mix things. But I'll just join them, with a comment on the test to make it
clear its the old behavior.

In the case where a given input token matches no rules in the FST, are
we still able to pass that through to the output without buffering
(calling capture())?

I didn't test this specifically, but would think yes. The new behavior only
handles matches differently, I didn't have to do any extra buffering before the
match, nor did I change anything on the matching part of the code. The extra
buffering is needed only when there was already buffering (lookahead for a
partial match) and the synonym was longer than the match.

SynonymFilter should generate a correct (or, at least, better) graph

Key: LUCENE-6582
URL: https://issues.apache.org/jira/browse/LUCENE-6582
Project: Lucene - Core
Issue Type: Bug
Reporter: Ian Ribas
Attachments: LUCENE-6582.patch, after.png, after2.png, before.png

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592848#comment-14592848
]

Ian Ribas commented on LUCENE-6582:
---

This comment was written early in my attempts and I'm afraid I didn't review it
later. In the final solution, it is much clearer that the longer terms are not
individual synonyms of the shorter terms, specially since the longest version
is what appears first on the output. I'll have to change the comment, and take
away the TODO (it was a marker for things to remember to ask when submitting).

I don't think things will get sausaged, but I'll look into
TokenStreamToAutomaton to see if I can understand it enough to use it.

What might still happen is that a phrase u s of america would match, as is
implied on the second question on the comment above, which is a lesser problem
but still kind of wrong. More on that below.

SynonymFilter should generate a correct (or, at least, better) graph

Key: LUCENE-6582
URL: https://issues.apache.org/jira/browse/LUCENE-6582
Project: Lucene - Core
Issue Type: Bug
Reporter: Ian Ribas
Attachments: LUCENE-6582.patch, after.png, before.png

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592849#comment-14592849
]

Ian Ribas commented on LUCENE-6582:
---

I don't think things will get sausaged, but I'll look into
TokenStreamToAutomaton to see if I can understand it enough to use it.

What might still happen is that a phrase u s of america would match, as is
implied on the second question on the comment above, which is a lesser problem
but still kind of wrong. More on that below.

SynonymFilter should generate a correct (or, at least, better) graph

Key: LUCENE-6582
URL: https://issues.apache.org/jira/browse/LUCENE-6582
Project: Lucene - Core
Issue Type: Bug
Reporter: Ian Ribas
Attachments: LUCENE-6582.patch, after.png, before.png

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592430#comment-14592430
]

Michael McCandless commented on LUCENE-6582:

I will dig more into this patch, it is a nice (big!) change, but this TODO
caught my eye:

{noformat}
// TODO: Problems: In the substitution below, how to identify that the
terms united states of america
// are actually a phrase and not individual synonyms of usa? And how to
differentiate that phrase from the
// phrase u s a? We can do that adding position lengths ...
{noformat}

One approach could be to use TokenStreamToAutomaton, then enumerate all finite
strings from the resulting automaton, and assert it's as expected? I.e. that
things did not get unexpectedly sausaged?

SynonymFilter should generate a correct (or, at least, better) graph

Key: LUCENE-6582
URL: https://issues.apache.org/jira/browse/LUCENE-6582
Project: Lucene - Core
Issue Type: Bug
Reporter: Ian Ribas
Attachments: LUCENE-6582.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592501#comment-14592501
 ] 

Michael McCandless commented on LUCENE-6582:


Just to confirm the scope of this change, if I have these synonyms:

{noformat}
  wtf -- what the fudge
  wtf -- wow that's funny
{noformat}

And then I'm tokenizing this:

{noformat}
  wtf happened
{noformat}

Before this change (today) I get this crazy sausage incorrectly
matching phrases like wtf the fudge and wow happened funny:

!before.png!

But after this change, the expanded synonyms become separate paths in
the graph right?  So it will look like this?:

!after.png!

Matching exactly the right phrases?  Note that absolute position
numbers become somewhat meaningless now ... they are really node
IDs.

Some small things I noticed:

{noformat}
   originalInputIdx = -10;
{noformat}

Why -10?

Can we just add the new test cases into the existing (tiny)
TestMultiWordSynonyms.java?

Related: LUCENE-5012 is an effort to make such graph consuming and
producing token streams easier ... but that's a major change to the
TokenStream API.

In the case where a given input token matches no rules in the FST, are
we still able to pass that through to the output without buffering
(calling capture())?


 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, after.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

14 matches

Site Navigation

Mail list logo

Footer information