[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501239#comment-16501239 ] ASF subversion and git services commented on LUCENE-8332: - Commit 9b61121ffb65d59f49429aba99b1c1b641ddb3c6 in lucene-solr's branch refs/heads/branch_7x from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9b61121 ] LUCENE-8332: New ConcatenateGraphFilter (from CompletionTokenStream). * Added a test for FingerprintFilter and clarified FF's end condition. (cherry picked from commit f9f5e83) > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501229#comment-16501229 ] ASF subversion and git services commented on LUCENE-8332: - Commit f9f5e837450e082ae7e1a82a0693760af7485a1b in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f9f5e83 ] LUCENE-8332: New ConcatenateGraphFilter (from CompletionTokenStream). * Added a test for FingerprintFilter and clarified FF's end condition. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500280#comment-16500280 ] Steve Rowe commented on LUCENE-8332: bq. I hate to bother you Steve Rowe but if you have any tips on diagnosing puzzling Yetus build failures then I'd appreciate it. The patch looks like it was produced by IntelliJ, which appears not to be compatible with some git tooling. Perhaps just regen using {{git diff}}? > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499180#comment-16499180 ] David Smiley commented on LUCENE-8332: -- This Yetus/QA report doesn't make sense to me. The QA build failed, apparently due to some compilation issue: {quote} [javac] Compiling 35 source files to /x1/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/lucene/build/suggest/classes/test [javac] /x1/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/lucene/suggest/src/test/org/apache/lucene/search/suggest/document/CompletionTokenStreamTest.java:34: error: class TestConcatenateGraphFilter is public, should be declared in a file named TestConcatenateGraphFilter.java [javac] public class TestConcatenateGraphFilter extends BaseTokenStreamTestCase { [javac]^ {quote} But the patch *renames* CompletionTokenStreamTest.java to TestConcatenateGraphFilter.java (and manually inspecting the patch reflects this) so I don't see out this came to be. I hate to bother you [~steve_rowe] but if you have any tips on diagnosing puzzling Yetus build failures then I'd appreciate it. I intend on committing the patch Monday, BTW. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499167#comment-16499167 ] Lucene/Solr QA commented on LUCENE-8332: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 22s{color} | {color:red} suggest in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 22s{color} | {color:red} suggest in the patch failed. {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 41s{color} | {color:green} Release audit (RAT) rat-sources passed {color} | | {color:red}-1{color} | {color:red} Release audit (RAT) {color} | {color:red} 0m 5s{color} | {color:red} suggest in the patch failed. {color} | | {color:red}-1{color} | {color:red} Check forbidden APIs {color} | {color:red} 0m 41s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed {color} | | {color:red}-1{color} | {color:red} Validate source patterns {color} | {color:red} 0m 41s{color} | {color:red} Check forbidden APIs check-forbidden-apis failed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 47s{color} | {color:green} common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 12s{color} | {color:red} suggest in the patch failed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 12m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-8332 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12926180/LUCENE-8332.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene1-us-west 3.13.0-88-generic #135-Ubuntu SMP Wed Jun 8 21:10:42 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 3dc4fa1 | | ant | version: Apache Ant(TM) version 1.9.3 compiled on April 8 2014 | | Default Java | 1.8.0_172 | | compile | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt | | javac | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt | | Release audit (RAT) | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-compile-lucene_suggest.txt | | Check forbidden APIs | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-check-forbidden-apis-lucene.txt | | Validate source patterns | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-check-forbidden-apis-lucene.txt | | unit | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/artifact/out/patch-unit-lucene_suggest.txt | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/testReport/ | | modules | C: lucene/analysis/common lucene/suggest U: lucene | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/24/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory,
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498605#comment-16498605 ] David Smiley commented on LUCENE-8332: -- Precommit now passes. I needed more javadocs on CompletionTokenStream. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498554#comment-16498554 ] David Smiley commented on LUCENE-8332: -- After running tests, I realized I needed to make some tweaks to end() and close() to delegate properly. I love the rigorousness of TestFactories.java and the underlying BaseTokenStreamTestCase :) > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch, LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498323#comment-16498323 ] David Smiley commented on LUCENE-8332: -- Final patch from the GitHub PR. Thanks [~jimczi] for your reviewing. I did make one change: I removed the bug fix to TokenStreamToAutomaton as it deserved its own issue – LUCENE-8344. I commented out the trailing stopword in org.apache.lucene.analysis.miscellaneous.TestConcatenateGraphFilter#testWithStopword so this test wouldn't fail, leaving a reference to the other JIRA. It's kinda a shame the Git history may be less than ideal on ConcatenateGraphFilter since CompletionTokenStream will stay in a gutted form that delegates to it. I could commit the new CTS in a second commit (first commit would be broken), and push the two together at once to ASF git servers. Probably not worth the hassle though. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8332.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492904#comment-16492904 ] David Smiley commented on LUCENE-8332: -- Oh I wanted to mention one thing; perhaps just here though I could put in the docs. An alternative approach to this tagger might be to use the SynonymGraphFilter (with other steps/configuration), which has a lot of similarities with the Tagger's algorithm. I've heard of others that have done this (Dice.com?), and before I created the tagger I thought about this approach too. There are some issues/barriers to "just" using the synonym filter:: * if the filter finds multiple overlapping matches, it only returns one without any control over its choice. (compare to the STT's "overlaps" param with several choices and it's pluggable) * the filter doesn't hold any metadata; it's just a set of names. Though you could use synonyms to map to an ID that you then lookup in something else (e.g. some DB or Solr index). * the synonym filter must re-construct its FST on startup each time; customizations are necessary to load an existing one from disk. * you have to arrange for any text processing/analysis (e.g. tokenization rules or phonetic filters) of the dictionary to create synonym entries. With the STT this is all configurable in a standard way like any text field. * and of course you'd have to glue it all together somehow. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491501#comment-16491501 ] David Smiley commented on LUCENE-8332: -- I had TestRandomChains go at it and it uncovered a couple things. {{org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException}} has two checks: # ensures incrementToken() fails if reset() wasn't first called. This was pretty straight-forward to fix by adding an IllegalStateException throw at the start in ConcatenateGraphFilter.incrementToken. # ensures if you forgot to close(), that trying to get the tokenStream again fails. This one is tricky. ConcatenateGraphFilter.reset() will consume the whole tokenStream including closing it... and it's hard to disagree with that. It calls toAutomaton which does this, and there are even some callers of this toAutomaton method in the NRTSuggester which is assuming it's going to be closed. I think adding some closed flag isn't enough since when Analyzer.tokenStream() is called we want it to fail but all that does is set the reader (which throws if it wasn't closed). I could make toAutomaton not close the input but then the callers need to deal with that; I'm not which path to go or if I'm missing something. Or maybe just punt and have TestRandomChains ignore as it's a bit too pedantic here? > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491312#comment-16491312 ] David Smiley commented on LUCENE-8332: -- See GitHub PR. comments welcome! * Made it a TokenFilter. * Move some package accessible constants from CompletionAnalyzer to public ones on the moved TokenStream class. * Renamed SEP_LABEL => SEP_CHAR and made it of type char, not an int. This seems more natural/clear to me and it removed some now needless casting in tests. I'm tempted to make this char configurable but I suppose it's value doesn't really matter. * Changed position increment handling of this filter to be stacked instead of consecutive. I think this just makes sense for what's going on. This affected some test assertions but otherwise it shouldn't matter for its practical uses. Changing this also revealed a bug... * clearAttributes() was being called before toAutomaton() consumed the input but this left the attribute state dirty; a test failed on me relating to offset ordering for this reason. I moved toAutomaton to occur in reset() and now that's a non-issue. This is also a slight optimization since clearAttributes() need not be called if finiteStrings.next() returns false (thus clearAttributes() was being called one too many times per stream). * added captureState and restoreState for end(). Though I do wonder if we actually need to anything in end(); based on my reading of TokenStreamToAutomaton end of stream calculations, the automaton contains the final token state already; so perhaps end() ought to do nothing? When I tried that, the state machine of the tests complained I didn't propagate end() -- but this didn't account for the fact that end() was in fact called earlier via reset() since the tokenStream is consumed there. _Any way_, at least what's in this patch is no worse than what was before, I think. * in reset(), removed needless hasAttribute() guard around getAttribute() * There are nocommits surrounding about the configuration aspects of the filter being public so that org.apache.lucene.search.suggest.document.ContextSuggestField#wrapTokenStream can access it. IMO this seems quirky... it'd be nice to remove the need to access it. But I suppose just adding some getters to the filter is innocent enough. * The setPayload feature of this filter is debatable. Arguably this ought to be done by a standalone TokenFilter or be done in some internal way to the NRT suggester. I marked the setter lucene.internal. * added testWithStopword which found a bug... * Bug: TokenStreamToAutomaton does not handle trailing stop words correctly when preservePositionIncrements==false > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
[ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490210#comment-16490210 ] David Smiley commented on LUCENE-8332: -- {quote}It's a TokenStream because it consumes the entire input in the first call to incrementToken (which invokes input.reset(), input.end(), input.close()) {quote} Is that policy documented anywhere? FingerprintFilter behaves similarly yet is a TokenFilter. So I'm thinking that ConcatenateGraphFilter has a nice ring to it ;) I'm working on it; I can post a patch or GitHub PR when done; sometime Friday I expect. > New ConcatenateGraphTokenStream (move/rename CompletionTokenStream) > --- > > Key: LUCENE-8332 > URL: https://issues.apache.org/jira/browse/LUCENE-8332 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > > Lets move and rename the CompletionTokenStream in the suggest module into the > analysis module renamed as ConcatenateGraphTokenStream. See comments in > LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has > several uses: > * for the suggest module > * by the SolrTextTagger for NER/ERD use cases – SOLR-12376 > * for doing complete match search efficiently > It will need a factory – a TokenFilterFactory, even though we don't have a > TokenFilter based subclass of TokenStream. > It appears there is no back-compat concern in it suddenly disappearing from > the suggest module as it's marked experimental and it only seems to be public > now perhaps due to some technicality (it has package level constructors). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org