[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-04 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501239#comment-16501239
 ] 

ASF subversion and git services commented on LUCENE-8332:
-

Commit 9b61121ffb65d59f49429aba99b1c1b641ddb3c6 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9b61121 ]

LUCENE-8332: New ConcatenateGraphFilter (from CompletionTokenStream).
* Added a test for FingerprintFilter and clarified FF's end condition.

(cherry picked from commit f9f5e83)


> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-04 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501229#comment-16501229
 ] 

ASF subversion and git services commented on LUCENE-8332:
-

Commit f9f5e837450e082ae7e1a82a0693760af7485a1b in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f9f5e83 ]

LUCENE-8332: New ConcatenateGraphFilter (from CompletionTokenStream).
* Added a test for FingerprintFilter and clarified FF's end condition.


> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-04 Thread Steve Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500280#comment-16500280
 ] 

Steve Rowe commented on LUCENE-8332:


bq. I hate to bother you Steve Rowe but if you have any tips on diagnosing 
puzzling Yetus build failures then I'd appreciate it.

The patch looks like it was produced by IntelliJ, which appears not to be 
compatible with some git tooling.  Perhaps just regen using {{git diff}}?

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-02 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499180#comment-16499180
 ] 

David Smiley commented on LUCENE-8332:
--

This Yetus/QA report doesn't make sense to me.  The QA build failed, apparently 
due to some compilation issue:
{quote}
[javac] Compiling 35 source files to 
/x1/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/lucene/build/suggest/classes/test
[javac] 
/x1/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/lucene/suggest/src/test/org/apache/lucene/search/suggest/document/CompletionTokenStreamTest.java:34:
 error: class TestConcatenateGraphFilter is public, should be declared in a 
file named TestConcatenateGraphFilter.java
[javac] public class TestConcatenateGraphFilter extends 
BaseTokenStreamTestCase {
[javac]^
{quote}

But the patch *renames* CompletionTokenStreamTest.java to 
TestConcatenateGraphFilter.java (and manually inspecting the patch reflects 
this) so I don't see out this came to be.  I hate to bother you [~steve_rowe] 
but if you have any tips on diagnosing puzzling Yetus build failures then I'd 
appreciate it.

I intend on committing the patch Monday, BTW.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-02 Thread Lucene/Solr QA (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499167#comment-16499167
 ] 

Lucene/Solr QA commented on LUCENE-8332:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 7 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
22s{color} | {color:red} suggest in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 22s{color} 
| {color:red} suggest in the patch failed. {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 41s{color} | {color:green} Release audit (RAT) rat-sources 
passed {color} |
| {color:red}-1{color} | {color:red} Release audit (RAT) {color} | {color:red}  
0m  5s{color} | {color:red} suggest in the patch failed. {color} |
| {color:red}-1{color} | {color:red} Check forbidden APIs {color} | {color:red} 
 0m 41s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed 
{color} |
| {color:red}-1{color} | {color:red} Validate source patterns {color} | 
{color:red}  0m 41s{color} | {color:red} Check forbidden APIs 
check-forbidden-apis failed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m 
47s{color} | {color:green} common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 12s{color} 
| {color:red} suggest in the patch failed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 12m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-8332 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12926180/LUCENE-8332.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 
21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 3dc4fa1 |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 |
| Default Java | 1.8.0_172 |
| compile | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt
 |
| javac | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt
 |
| Release audit (RAT) | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt
 |
| Check forbidden APIs | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-check-forbidden-apis-lucene.txt
 |
| Validate source patterns | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-check-forbidden-apis-lucene.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-unit-lucene_suggest.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/testReport/ |
| modules | C: lucene/analysis/common lucene/suggest U: lucene |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/24/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, 

[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-01 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498605#comment-16498605
 ] 

David Smiley commented on LUCENE-8332:
--

Precommit now passes.  I needed more javadocs on CompletionTokenStream.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-01 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498554#comment-16498554
 ] 

David Smiley commented on LUCENE-8332:
--

After running tests, I realized I needed to make some tweaks to end() and 
close() to delegate properly.  I love the rigorousness of TestFactories.java 
and the underlying BaseTokenStreamTestCase  :)

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch, LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-06-01 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498323#comment-16498323
 ] 

David Smiley commented on LUCENE-8332:
--

Final patch from the GitHub PR.  Thanks [~jimczi] for your reviewing. I did 
make one change:  I removed the bug fix to TokenStreamToAutomaton as it 
deserved its own issue – LUCENE-8344.  I commented out the trailing stopword in 
org.apache.lucene.analysis.miscellaneous.TestConcatenateGraphFilter#testWithStopword
 so this test wouldn't fail, leaving a reference to the other JIRA.

It's kinda a shame the Git history may be less than ideal on 
ConcatenateGraphFilter since CompletionTokenStream will stay in a gutted form 
that delegates to it.  I could commit the new CTS in a second commit (first 
commit would be broken), and push the two together at once to ASF git servers.  
Probably not worth the hassle though.

 

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8332.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-05-28 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492904#comment-16492904
 ] 

David Smiley commented on LUCENE-8332:
--

Oh I wanted to mention one thing; perhaps just here though I could put in the 
docs.

An alternative approach to this tagger might be to use the SynonymGraphFilter 
(with other steps/configuration),
 which has a lot of similarities with the Tagger's algorithm.  I've heard of 
others that have done this (Dice.com?), and before I created the tagger I 
thought about this approach too.  There are some issues/barriers to "just" 
using the synonym filter::
* if the filter finds multiple overlapping matches, it only returns one without 
any control over its choice.  (compare to the STT's "overlaps" param with 
several choices and it's pluggable)
* the filter doesn't hold any metadata; it's just a set of names.  Though you 
could use synonyms to map to an ID that you then lookup in something else (e.g. 
some DB or Solr index).
* the synonym filter must re-construct its FST on startup each time; 
customizations are necessary to load an existing one from disk.
* you have to arrange for any text processing/analysis (e.g. tokenization rules 
or phonetic filters) of the dictionary to create synonym entries.  With the STT 
this is all configurable in a standard way like any text field.
* and of course you'd have to glue it all together somehow.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-05-25 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491501#comment-16491501
 ] 

David Smiley commented on LUCENE-8332:
--

I had TestRandomChains go at it and it uncovered a couple things.

{{org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException}} has 
two checks:
 # ensures incrementToken() fails if reset() wasn't first called.  This was 
pretty straight-forward to fix by adding an IllegalStateException throw at the 
start in ConcatenateGraphFilter.incrementToken.
 # ensures if you forgot to close(), that trying to get the tokenStream again 
fails.  This one is tricky.  ConcatenateGraphFilter.reset() will consume the 
whole tokenStream including closing it... and it's hard to disagree with that.  
It calls toAutomaton which does this, and there are even some callers of this 
toAutomaton method in the NRTSuggester which is assuming it's going to be 
closed.  I think adding some closed flag isn't enough since when 
Analyzer.tokenStream() is called we want it to fail but all that does is set 
the reader (which throws if it wasn't closed).  I could make  toAutomaton not 
close the input but then the callers need to deal with that; I'm not which path 
to go or if I'm missing something.  Or maybe just punt and have 
TestRandomChains ignore as it's a bit too pedantic here?

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-05-25 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491312#comment-16491312
 ] 

David Smiley commented on LUCENE-8332:
--

See GitHub PR.  comments welcome!

* Made it a TokenFilter.
* Move some package accessible constants from CompletionAnalyzer to public ones 
on the moved TokenStream class.
* Renamed SEP_LABEL => SEP_CHAR and made it of type char, not an int. This 
seems more natural/clear to me and it removed some now needless casting in 
tests. I'm tempted to make this char configurable but I suppose it's value 
doesn't really matter.
* Changed position increment handling of this filter to be stacked instead of 
consecutive. I think this just makes sense for what's going on. This affected 
some test assertions but otherwise it shouldn't matter for its practical uses. 
Changing this also revealed a bug...
* clearAttributes() was being called before toAutomaton() consumed the input 
but this left the attribute state dirty; a test failed on me relating to offset 
ordering for this reason. I moved toAutomaton to occur in reset() and now 
that's a non-issue. This is also a slight optimization since clearAttributes() 
need not be called if finiteStrings.next() returns false (thus 
clearAttributes() was being called one too many times per stream).
* added captureState and restoreState for end(). Though I do wonder if we 
actually need to anything in end(); based on my reading of 
TokenStreamToAutomaton end of stream calculations, the automaton contains the 
final token state already; so perhaps end() ought to do nothing? When I tried 
that, the state machine of the tests complained I didn't propagate end() -- but 
this didn't account for the fact that end() was in fact called earlier via 
reset() since the tokenStream is consumed there. _Any way_, at least what's in 
this patch is no worse than what was before, I think.
* in reset(), removed needless hasAttribute() guard around getAttribute()
* There are nocommits surrounding about the configuration aspects of the filter 
being public so that 
org.apache.lucene.search.suggest.document.ContextSuggestField#wrapTokenStream 
can access it. IMO this seems quirky... it'd be nice to remove the need to 
access it. But I suppose just adding some getters to the filter is innocent 
enough.
* The setPayload feature of this filter is debatable. Arguably this ought to be 
done by a standalone TokenFilter or be done in some internal way to the NRT 
suggester. I marked the setter lucene.internal.
* added testWithStopword which found a bug...
* Bug: TokenStreamToAutomaton does not handle trailing stop words correctly 
when preservePositionIncrements==false

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)

2018-05-24 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490210#comment-16490210
 ] 

David Smiley commented on LUCENE-8332:
--

{quote}It's a TokenStream because it consumes the entire input in the first 
call to incrementToken (which invokes input.reset(), input.end(), input.close())
{quote}
Is that policy documented anywhere?  FingerprintFilter behaves similarly yet is 
a TokenFilter.  So I'm thinking that ConcatenateGraphFilter has a nice ring to 
it ;)

I'm working on it; I can post a patch or GitHub PR when done; sometime Friday I 
expect.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> ---
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org