[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-07-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618511#comment-14618511
 ] 

Michael McCandless commented on LUCENE-6582:


bq. I'm going to now try to simplify SynFilter by removing the hairy graph 
flattening it must do today

I opened LUCENE-6664 for this... it seems to work!

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, LUCENE-6582.patch, 
 after.png, after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-07-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611746#comment-14611746
 ] 

Michael McCandless commented on LUCENE-6582:


Thanks [~ianribas]!

I've opened LUCENE-6638 to create a graph flattener TokenFilter, and it seems 
to work well ... I'm going to now try to simplify SynFilter by removing the 
hairy graph flattening it must do today, and have it create correct graph 
outputs.  I think with the combination of these two we can then have 100% 
accurate synonym graphs for accurate query-time searches, but also have the 
sausagized version that indexing needs to match the graph corruption we do 
today.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, LUCENE-6582.patch, 
 after.png, after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-23 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597570#comment-14597570
 ] 

Ian Ribas commented on LUCENE-6582:
---

I'll be glad to help either way. I trust whatever priority you both 
decide.Where should I start looking for the next step? QueryBuilder? 

bq. some of the complexity in the current one is because it's being forced to 
sausagify at the same time as producing synonyms, so if we can more cleanly 
split out those two functions maybe we get simpler code 

Very good point! It does depend, though, on how the graph is represented, of 
course. I'll try to keep this in mind when working on the correct graph (now 
or later).

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597359#comment-14597359
 ] 

Michael McCandless commented on LUCENE-6582:


I agree [~rcmuir], we can do the correct graph as a follow-on issue; this one 
is already a good step towards that.

I'll look more closely at this patch.

I can also help with phase 2 ... e.g. maybe build the re-sausagizer filter.  
I do think phase 2 may result in a simpler syn filter implementation ... some 
of the complexity in the current one is because it's being forced to sausagify 
at the same time as producing synonyms, so if we can more cleanly split out 
those two functions maybe we get simpler code ... not sure.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596804#comment-14596804
 ] 

Robert Muir commented on LUCENE-6582:
-

Can we consider mike's crazy graph as a second step? I don't know, it just 
seems like the current patch might be an easier-to-digest improvement. Then 
again I get totally lost in the synonymfilter :)

There is a lot still to do after even this first step before users get real 
improvements, for instance fixing querybuilder for the boolean case (hopefully 
easier now), and improving stuff like sandbox/TermAutomatonQuery (maybe?) for 
the positional case, and so on.

If we spend the work on a hypercorrect graph for synonymfilter I'm just afraid 
of things getting lost, queryparser can't make use of it because posinc/poslen 
semantics become too confusing, analysis is hard to reason about because 
posinc/poslen semantics are too confusing, more performance tradeoffs (like 
graph re-sausaging at index time), ...


 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596772#comment-14596772
 ] 

Ian Ribas commented on LUCENE-6582:
---

Ok, I think I have an idea on how to do this, but it will definitely need some 
thinking, so it will probably take a while.

bq. I think that's fine, I think better correctness trumps the added buffering 
cost.

I completely agree.

bq. I think if we fix syn filter here to produce the correct graph, we should 
also insert a sausagizer phase that turns this graph back into a sausage for 
indexing?

I think Robert also commented something on these lines in his answer to my 
email. I think I understand the general idea of what that means, but I would 
certainly appreciate some guidance, when the time comes. I'll focus on 
producing a correct graph first.

This also means that maybe I'll need changes on the test validations, since we 
might run into conditions that are considered wrong now. Specially regarding 
offsets (start and end) and their relation to position lengths. But I'll see 
what I can do about that too.

bq. However, if you apply syn filter at search time, we could fix query parsers 
to possibly do the right thing here

I was planning taking a shot at that too, once this part is finished. To make 
the solution more complete. And, again, I'll certainly appreciate ideas when 
the time comes.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596490#comment-14596490
 ] 

Michael McCandless commented on LUCENE-6582:


bq. I really hadn't thought of using position lengths as references, like 
this!

It's hard to think about :)  But it just means the positions become node IDs, 
and you must number the nodes properly (so that any token always goes from 
node X to Y where Y  X).

bq. One problem that I see is that I'll need more buffering 

I think that's fine, I think better correctness trumps the added buffering cost.

bq. One other doubt I have is how this affects the indexer. I imagine it saves 
position lengths on the index too, so this shouldn't be a problem, right?

The index does NOT record position length today... I think if we fix syn filter 
here to produce the correct graph, we should also insert a sausagizer phase 
that turns this graph back into a sausage for indexing?  (So that what the 
fudge and wow that's funny will in fact match a document that had wtf).

However, if you apply syn filter at search time, we could fix query parsers to 
possibly do the right thing here, e.g. translating this graph into a union of 
phrase queries, or using TermAutomatonQuery (in sandbox still), or something ...

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-22 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596318#comment-14596318
 ] 

Ian Ribas commented on LUCENE-6582:
---

I really hadn't thought of using position lengths as references, like this! 
For some reason I could only think of position length spanning over terms on 
the base stream, not creating independent branches. And so I was stuck in 
thinking  how to represent the graph correctly once I saw that my 
implementation was still a sausage on the overlapping phrases.

I will try to implement this solution and see how far I can get.

One problem that I see is that I'll need more buffering as now the future 
output is no longer the size of the longest synonym but the added sizes of all 
synonyms that are phrases (each minus one). Getting this number at the 
beginning might require a change on the synonym parser...

One other doubt I have is how this affects the indexer. I imagine it saves 
position lengths on the index too, so this shouldn't be a problem, right?

Thanks very much for such clear guidelines!

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595017#comment-14595017
 ] 

Michael McCandless commented on LUCENE-6582:


bq.  changed -10 to -1

Thank you!

bq. Unfortunately, I could not merge the tests into a single file because they 
have different base classes and the new tests depend on asserts on the base 
class.

Ugh, OK, thanks for trying.  It's fine to leave it separate...

bq. The absolute positions still have meaning.

OK but with this change we have now created new nodes (something SynFilter does 
not do today), because wtf now has posLen=3.  This is great (necessary!) for 
SynFilter to be correct...

bq. And there is no difference in the representation of the tokens 
unexpensive as a synonym for cheap from wow as a synonym for what on 
the previous rules, using the attributes.

Well, it is entirely possible for PosInc/PosLenAtt to express the correct 
graph, it's just hairy to implement, but I think your patch is part way there 
(it creates new positions!).

E.g. here's a sequence of tokens that would be the fully correct graph output 
for the wtf example:

||token||posInc||posLen||
|wtf|1|5|
|what|0|1|
|wow|0|3|
|the|1|1|
|fudge|1|3|
|that's|1|1|
|funny|1|1|
|happened|1|1|

It corresponds to this graph:

!after3.png!

The token posInc/posLen is just a rote serialization of the arcs of the graph 
based on how the states are numbered, and other numberings would be possible 
resulting in different token outputs because there is inherent ambiguity in how 
you serialize a graph.  I think the only constraints are that 1) all arcs 
leaving a given state must be serialized one after another (exactly like 
Lucene's Automaton class!), 2) an arc from node X must go to another node  X 
(i.e., arcs cannot go to an earlier numbered node).

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, LUCENE-6582.patch, after.png, 
 after2.png, after3.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592896#comment-14592896
 ] 

Ian Ribas commented on LUCENE-6582:
---

 But after this change, the expanded synonyms become separate paths in
 the graph right? So it will look like this?
 Matching exactly the right phrases? 

Actually, I think it looks more like this:

!after2.png!

That means that wtf the fudge and wow happened funny will no longer match, 
but wow the fudge would.

The absolute positions still have meaning. I couldn't figure out how to clearly 
separate the two phrases from the rules (as was mentioned on the TODO comment 
above) using the token attributes. They will be stacked in the same order on 
each position, but that doesn't seem to be enough to make things unambiguous. 
Specially since with rules like:

{noformat}
pass, ticket
cheap, unexpensive
{noformat}

and tokenizing:

{noformat}
cheap pass
{noformat}

I would expect to mach all the resulting combinations: cheap pass, 
unexpensive pass, cheap ticket and unexpensive ticket. And there is no 
difference in the representation of the tokens unexpensive as a synonym for 
cheap from wow as a synonym for what on the previous rules, using the 
attributes.

 Why -10?

I was unsure if -1 as an invalid value was clear enough and ended up using -10. 
It could probably just be -1. I'll check.

 Can we just add the new test cases into the existing (tiny) 
 TestMultiWordSynonyms.java?

Probably. Since all tests in the new file use the new constructor to force the 
new behavior, and TestMultiWordSynonyms tests the old behavior, I didn'tt want 
to mix things. But I'll just join them, with a comment on the test to make it 
clear its the old behavior.

 In the case where a given input token matches no rules in the FST, are
 we still able to pass that through to the output without buffering
 (calling capture())?

I didn't test this specifically, but would think yes. The new behavior only 
handles matches differently, I didn't have to do any extra buffering before the 
match, nor did I change anything on the matching part of the code. The extra 
buffering is needed only when there was already buffering (lookahead for a 
partial match) and the synonym was longer than the match.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, after.png, after2.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592848#comment-14592848
 ] 

Ian Ribas commented on LUCENE-6582:
---

This comment was written early in my attempts and I'm afraid I didn't review it 
later. In the final solution, it is much clearer that the longer terms are not 
individual synonyms of the shorter terms, specially since the longest version 
is what appears first on the output. I'll have to change the comment, and take 
away the TODO (it was a marker for things to remember to ask when submitting).

I don't think things will get sausaged, but I'll look into 
TokenStreamToAutomaton to see if I can understand it enough to use it.

What might still happen is that a phrase u s of america would match, as is 
implied on the second question on the comment above, which is a lesser problem 
but still kind of wrong. More on that below.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, after.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Ian Ribas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592849#comment-14592849
 ] 

Ian Ribas commented on LUCENE-6582:
---

This comment was written early in my attempts and I'm afraid I didn't review it 
later. In the final solution, it is much clearer that the longer terms are not 
individual synonyms of the shorter terms, specially since the longest version 
is what appears first on the output. I'll have to change the comment, and take 
away the TODO (it was a marker for things to remember to ask when submitting).

I don't think things will get sausaged, but I'll look into 
TokenStreamToAutomaton to see if I can understand it enough to use it.

What might still happen is that a phrase u s of america would match, as is 
implied on the second question on the comment above, which is a lesser problem 
but still kind of wrong. More on that below.

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, after.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592430#comment-14592430
 ] 

Michael McCandless commented on LUCENE-6582:


I will dig more into this patch, it is a nice (big!) change, but this TODO 
caught my eye:

{noformat}
// TODO: Problems: In the substitution below, how to identify that the 
terms united states of america
// are actually a phrase and not individual synonyms of usa? And how to 
differentiate that phrase from the
// phrase u s a? We can do that adding position lengths ...
{noformat}

One approach could be to use TokenStreamToAutomaton, then enumerate all finite 
strings from the resulting automaton, and assert it's as expected?  I.e. that 
things did not get unexpectedly sausaged?

 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6582) SynonymFilter should generate a correct (or, at least, better) graph

2015-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592501#comment-14592501
 ] 

Michael McCandless commented on LUCENE-6582:


Just to confirm the scope of this change, if I have these synonyms:

{noformat}
  wtf -- what the fudge
  wtf -- wow that's funny
{noformat}

And then I'm tokenizing this:

{noformat}
  wtf happened
{noformat}

Before this change (today) I get this crazy sausage incorrectly
matching phrases like wtf the fudge and wow happened funny:

!before.png!

But after this change, the expanded synonyms become separate paths in
the graph right?  So it will look like this?:

!after.png!

Matching exactly the right phrases?  Note that absolute position
numbers become somewhat meaningless now ... they are really node
IDs.

Some small things I noticed:

{noformat}
   originalInputIdx = -10;
{noformat}

Why -10?

Can we just add the new test cases into the existing (tiny)
TestMultiWordSynonyms.java?

Related: LUCENE-5012 is an effort to make such graph consuming and
producing token streams easier ... but that's a major change to the
TokenStream API.

In the case where a given input token matches no rules in the FST, are
we still able to pass that through to the output without buffering
(calling capture())?


 SynonymFilter should generate a correct (or, at least, better) graph
 

 Key: LUCENE-6582
 URL: https://issues.apache.org/jira/browse/LUCENE-6582
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ian Ribas
 Attachments: LUCENE-6582.patch, after.png, before.png


 Some time ago, I had a problem with synonyms and phrase type queries 
 (actually, it was elasticsearch and I was using a match query with multiple 
 terms and the and operator, as better explained here: 
 https://github.com/elastic/elasticsearch/issues/10394).
 That issue led to some work on Lucene: LUCENE-6400 (where I helped a little 
 with tests) and  LUCENE-6401. This issue is also related to LUCENE-3843.
 Starting from the discussion on LUCENE-6400, I'm attempting to implement a 
 solution. Here is a patch with a first step - the implementation to fix 
 SynFilter to be able to 'make positions' (as was mentioned on the 
 [issue|https://issues.apache.org/jira/browse/LUCENE-6400?focusedCommentId=14498554page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14498554]).
  In this way, the synonym filter generates a correct (or, at least, better) 
 graph.
 As the synonym matching is greedy, I only had to worry about fixing the 
 position length of the rules of the current match, no future or past synonyms 
 would span over this match (please correct me if I'm wrong!). It did 
 require more buffering, twice as much.
 The new behavior I added is not active by default, a new parameter has to be 
 passed in a new constructor for {{SynonymFilter}}. The changes I made do 
 change the token stream generated by the synonym filter, and I thought it 
 would be better to let that be a voluntary decision for now.
 I did some refactoring on the code, but mostly on what I had to change for 
 may implementation, so that the patch was not too hard to read. I created 
 specific unit tests for the new implementation 
 ({{TestMultiWordSynonymFilter}}) that should show how things will be with the 
 new behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org