[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-12-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294139#comment-16294139
 ] 

Michael McCandless commented on LUCENE-5012:


This issue should make it easier to fix the bug you're seeing, but we can also 
fix the bug (in {{ShingleFilter}} I'm guessing?) before doing this more 
ambitious change.

It sounds like {{ShingleFilter}} is not looking at {{PositionLengthAttribute}}?

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch, LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-11-28 Thread Ryan Pedela (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269235#comment-16269235
 ] 

Ryan Pedela commented on LUCENE-5012:
-

I just have a question. I am using the word delimiter graph set to both split 
and concat followed by a shingle filter set to bigrams. I am getting what 
appears to be incorrect results. Is that expected and is the aim of this issue 
to fix that?

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch, LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-01-19 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830136#comment-15830136
 ] 

Matt Weber commented on LUCENE-5012:


Thanks [~mikemccand] there was a lot of additional changes!  I am going to 
start getting familiar with this and hopefully will be able to help move it 
forward as I get time.

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch, LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-01-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828939#comment-15828939
 ] 

Michael McCandless commented on LUCENE-5012:


[~mattweber], I realized I had more private changes that I never pushed to that 
old branch, so I recovered them, fixed to apply to current master, and pushed 
here: https://github.com/mikemccand/lucene-solr/commits/graph_token_filters

I also removed the controversial {{InsertDeletedPunctuationStage}}.

Some tests are still failing ... I'll try to fix them.

I think the ideas here are very promising.  The write-once attributes 
(LUCENE-2450, folded into this branch) is cleaner than what Lucene has today, 
and the ease of making new positions without having to re-number previous ones 
makes graph token streams much easier.

I tried to add the equivalent of {{CharFilter}} here, by using a new 
{{TextAttribute}} that stages before tokenization can use to read from a 
{{Reader}} or a {{String}}, and remap; I like that this makes offset correction 
more local than what the {{correctOffset}} exposes today.  And it means char 
filtering is simply another stage, not a separate class.

I also added {{int[] parts}} to {{OffsetAttribute}}; the idea here is to 
empower token filters (not just tokenizers) to properly correct offsets, so 
that e.g. WDGF could work "correctly", but I'm not sure it's worth the hassle: 
I haven't fully implemented it, and doing so is surprisingly tricky.

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch, LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-01-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827939#comment-15827939
 ] 

Michael McCandless commented on LUCENE-5012:


Wow, thanks for modernizing the patch [~mattweber]; I'll push this to branch on 
my github account for easier iterating...

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch, LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-01-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826675#comment-15826675
 ] 

Michael McCandless commented on LUCENE-5012:


I think we really should advance this: the synonym filter here already handles 
incoming graphs correctly, and the token stream API improvements here make it 
much easier to consume and create graph token streams.  I think this will only 
get more important with time, e.g. the new {{WordDelimiterGraphFilter}} now 
creates correct graphs, but if you want to run the current 
{{SynonymGraphFilter}} after it, it won't work.

That said, there is still a lot of work to make this committable.  I think we 
need to find a way to make the stage-based analysis components interchangeable 
with the current API to give us the freedom to gradually cutover the many 
tokenizers and tokenfilters Lucene has today.

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2017-01-17 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826257#comment-15826257
 ] 

Matt Weber commented on LUCENE-5012:


[~mikemccand] So I was looking into supporting incoming graphs in 
{{SynonymGraphFilter}} and found this when you mentioned it in LUCENE-7638.  
What do you think the state of this patch is?  Would it be best to look into 
advancing this instead of just {{SynonymGraphFilter}} itself?

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2016-11-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670617#comment-15670617
 ] 

David Smiley commented on LUCENE-5012:
--

Seems very promising. Is LUCENE-2450 a dependency on this issue?  There's no 
dependency JIRA issue link the first comment suggests it is.

> Make graph-based TokenFilters easier
> 
>
> Key: LUCENE-5012
> URL: https://issues.apache.org/jira/browse/LUCENE-5012
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
> creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
> others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
> after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2015-08-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660091#comment-14660091
 ] 

ASF subversion and git services commented on LUCENE-5012:
-

Commit 1694511 from [~mikemccand] in branch 'dev/branches/lucene5012'
[ https://svn.apache.org/r1694511 ]

LUCENE-5012: don't separate interface from impl for attributes

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2014-05-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008345#comment-14008345
 ] 

ASF subversion and git services commented on LUCENE-5012:
-

Commit 1597427 from [~mikemccand] in branch 'dev/branches/lucene5012'
[ https://svn.apache.org/r1597427 ]

LUCENE-5012: get tests passing again

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2014-05-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007257#comment-14007257
 ] 

ASF subversion and git services commented on LUCENE-5012:
-

Commit 1597118 from [~mikemccand] in branch 'dev/branches/lucene5012'
[ https://svn.apache.org/r1597118 ]

LUCENE-5012: merge trunk, but some tests are failing

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-07-01 Thread Artem Lukanin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696626#comment-13696626
 ] 

Artem Lukanin commented on LUCENE-5012:
---

I guess WordDelimiterFilter is a good candidate for transforming into a 
graph-based filter (see LUCENE-5051).

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-26 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667418#comment-13667418
 ] 

Commit Tag Bot commented on LUCENE-5012:


[lucene5012 commit] mikemccand
http://svn.apache.org/viewvc?view=revisionrevision=1486483

LUCENE-5012: add CharFilter, fix some bugs with SynFilter, add new 
InsertDeletedPunctuationStage

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667419#comment-13667419
 ] 

Michael McCandless commented on LUCENE-5012:


I committed some changes:

  * Got charFilter working

  * Fixed a few bugs in SynFilterStage not clearing it's state on end
/ reset

  * Created a new fun stage: InsertDeletedPunctuationStage.  This
stage detects when a tokenizer has skipped over punctuation chars
and inserts a deleted token representing the punctuation, e.g. to
prevent a synonym or phrase query from matching over the
punctuation.  I had previously thought we would need to modify
Tokenizers to do this but now I think maybe this Stage could do it
for any Tokenizer ...


 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-24 Thread Artem Lukanin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13666519#comment-13666519
 ] 

Artem Lukanin commented on LUCENE-5012:
---

Great! I was asking several people about it at Lucene/Solr Revolution 2013.
Just don't forget, that if you set your default operator to AND, you should 
still use ORs for synonyms. I was trying to solve this problem partly in 
https://issues.apache.org/jira/browse/SOLR-4533. I'm not sure, how the 
combination of ORs and ANDs is done in an FSA if you are going to use 
WordAutomaton right away.

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663340#comment-13663340
 ] 

Jack Krupansky commented on LUCENE-5012:


Will this Jira include some test code that query parsers can use so that they 
can retrieve the graph for a stream containing multiple multi-term synonyms so 
that they can then individually sausage the term sequences as well as generate 
OR operators for string of sausages?


 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663402#comment-13663402
 ] 

Michael McCandless commented on LUCENE-5012:


bq. Will this Jira include some test code that query parsers can use so that 
they can retrieve the graph for a stream containing multiple multi-term 
synonyms so that they can then individually sausage the term sequences as well 
as generate OR operators for string of sausages?

I think so ... the test case (TestStages) sort of does that already: it turns 
the tokens into an automaton, to check that the automaton accepts the specified 
sequence of tokens (and ONLY those sequences of tokens).

But, I think ideally we'd have a WordAutomatonQuery, which could take the 
Automaton directly and match documents using that, instead of enumerating all 
phrases an OR'ing them (which would be equivalent but presumably slower...).  I 
would do WordAutomatonQuery on a separate issue ...

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663416#comment-13663416
 ] 

Adrien Grand commented on LUCENE-5012:
--

This looks very promising! I've been looking at a few TokenFilters recently and 
anything that would make working with graphs easier is very welcome! Maybe we 
should create a branch to make it easier to collaborate and to track 
incremental updates?

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663433#comment-13663433
 ] 

Michael McCandless commented on LUCENE-5012:


Good idea Adrien! I'll cut a branch and commit the patch ...


 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663434#comment-13663434
 ] 

Commit Tag Bot commented on LUCENE-5012:


[lucene5012 commit] mikemccand
http://svn.apache.org/viewvc?view=revisionrevision=1484977

LUCENE-5012: create branch

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663444#comment-13663444
 ] 

Commit Tag Bot commented on LUCENE-5012:


[lucene5012 commit] mikemccand
http://svn.apache.org/viewvc?view=revisionrevision=1484980

LUCENE-5012: initial prototype

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663445#comment-13663445
 ] 

Michael McCandless commented on LUCENE-5012:


OK I committed the initial patch to this branch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5012

 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier

2013-05-21 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663471#comment-13663471
 ] 

Jack Krupansky commented on LUCENE-5012:


bq. WordAutomatonQuery

Sounds quite promising.

Back to the query parsers... So, they would present a term or quoted string - 
and eventually hopefully a sequence of terms if the query parser sees that 
there is only white space between them (an issue Robert filed long ago) - and 
invoke analysis. Then what? Sometimes a single term or a clean sausage string 
comes out and a TermQuery or simple BooleanQuery or PhraseQuery needs to be 
generated, but if synonym-like filtering has generated a graph, then the query 
parser would hand it directly to WordAutomatonQuery, if I understand 
correctly. Then the question becomes how to tell that a WordAutomatonQuery 
graph is needed - unless WordAutomatonQuery automatically detects the cases for 
TermQuery and BooleanQuery/PhraseQuery as internal optimizations. (Well, I 
don't expect that WordAutomatonQuery would know how to do BooleanQuery vs. 
PhraseQuery, unless it has a phrase flag.)

In short, it would be nice if this issue directly or at least partially 
produced enough logic for that Term vs. Boolean vs. Phrase vs. WordAutomaton 
Query generation. Either to actually generate the final query, or at least some 
example code that documents the design pattern that a query parser needs for 
consumption of a query phrase graph.

In other words, the query parsers should not simply do a next for the entire 
output of query term analysis. A new design pattern is needed.

Also, at index time, the output of analysis is consumed as a single sausage 
stream, using next and token position increment, but any multiple multi-word 
synonyms would traditionally get somewhat mangled. There may not be a clean 
solution for the current index term posting format, but at a minimum we should 
reconsider how the output of index-time term analysis is consumed and flag 
potential improvements for the future for posting of multiple multi-term 
phrases at the same token position.

In any case, thanks for moving the multiple multi-term synonym ball forward!


 Make graph-based TokenFilters easier
 

 Key: LUCENE-5012
 URL: https://issues.apache.org/jira/browse/LUCENE-5012
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5012.patch


 SynonymFilter has two limitations today:
   * It cannot create positions, so eg dns - domain name service
 creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
 others).
   * It cannot consume a graph, so e.g. if you try to apply synonyms
 after Kuromoji tokenizer I'm not sure what will happen.
 I've thought about how to fix these issues but it's really quite
 difficult with the current PosInc/PosLen graph representation, so I'd
 like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org