[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-25 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452407#comment-16452407
 ] 

Alan Woodward commented on LUCENE-8273:
---

bq. Perhaps this can be extended to handle the case of ShingleFilter

I'm not sure that this makes sense in those cases though?  For example, what if 
the first token in the tokenstream matches the condition and is passed to the 
ShingleFilter, but the second one doesn't?

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-25 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452353#comment-16452353
 ] 

Mike Sokolov commented on LUCENE-8273:
--

Also - along the lines of making general purpose TokenFilter compositional 
tools, I think this general idea can be extended to handle multiple branches, 
so not just an 'if' statement, but we can also have a 'switch' statement that 
invokes one of a set of different wrapped filters, although if we did that, it 
should be a separate thing from this I think.

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-25 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452348#comment-16452348
 ] 

Mike Sokolov commented on LUCENE-8273:
--

Perhaps this can be extended to handle the case of ShingleFilter et al by 
tracking whether we are currently in a recursion. In the case of some filter 
that consumes multiple input tokens, it could call us multiple times while we 
are in DELEGATING, but if we remember that we called delegate.incrementToken() 
and it has not yet returned, then we should not recurse again, but should 
instead call input.incrementToken().


> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-25 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452274#comment-16452274
 ] 

Alan Woodward commented on LUCENE-8273:
---

Here's an updated patch:
* now works with wrapped filters that emit more than one token (thanks David!)
* renamed to ConditionalTokenFilter and the logic reversed (thanks Robert!)
* cleaned up all the logic around reset(), close() and end()
* integrated into testRandomChains.

This latter one is a bit clunky, as this TokenFilter won't work with filters 
that consume more than one token at a time - eg ShingleFilter or 
SynonymGraphFilter.  At the moment I have a blacklist, but there may be a 
better way of isolating that - preferably one that throws errors when you build 
the TokenStream.  Speak up if you have any suggestions.

I do like the idea of integrating things into CustomAnalyzer, will look at that 
next.

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451271#comment-16451271
 ] 

Robert Muir commented on LUCENE-8273:
-

Also I am not sure if the name BypassingTokenFilter is the best.

It works well for your case (but I think "bypass" may be due to some 
inertia/history and maybe not the best going forward). Maybe it should be "if" 
instead of "unless".

{code}
// don't lowercase if the term contains an "o" character
TokenStream t = new BypassingTokenFilter(cts, AssertingLowerCaseFilter::new) {
  CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  @Override
  protected boolean bypass() throws IOException {
return termAtt.toString().contains("o");
  }
};
{code}

But will look awkward for other cases:
{code}
// apply greek stemmer ("don't bypass") if the token is written in the greek 
script.
TokenStream t = new BypassingTokenFilter(ts, GreekStemmer::new) {
  ScriptAttribute scriptAtt = addAttribute(ScriptAttribute.class);
  @Override
  protected boolean bypass() throws IOException {
return scriptAtt.getCode() != UScript.GREEK;
  } 
};
{code}


> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451265#comment-16451265
 ] 

Robert Muir commented on LUCENE-8273:
-

{quote}
I added this to core rather than to the analysis module as it seems to me to be 
a utility class like FilteringTokenFilter, which is also in core. But I'm 
perfectly happy to move it to analysis-common if that makes more sense to 
others.
{quote}

The idea is cool but I would like to see it more fleshed out (eg. marked 
experimental somewhere) before going into core/:
* improved testing:  i'd like to see some edge cases tested such as both "true" 
and "false" cases on the final token for end(), etc. what happens is a little 
sneaky,  think it should be hooked into TestRandomChains (this should probably 
be explicitly added to that test, wrapping with check of random.nextBoolean() 
or something simple that will test all cases). This may uncover some 
integration difficulties. In particular, it is not clear to me how some stuff 
such as end() works correctly in the general case with this filter right now.
* integration with CustomAnalyzer: as this would add a generic "if" to allow 
branching in analysis chains (there is an issue somewhere for this), which 
would be very powerful, it would be good to plumb into CustomAnalyzer to make 
sure it can work well with the factory model. seems doable with the functional 
interface but needs to be proven out.


> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449994#comment-16449994
 ] 

Mike Sokolov commented on LUCENE-8273:
--

The name  "resetting" is a little confusing since it controls propagation of 
calls in end() and close() as well. Maybe call it "recursing" or "once" or 
something else?

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449993#comment-16449993
 ] 

David Smiley commented on LUCENE-8273:
--

Nice!

Could you add a test with a filter that may produce multiple terms instead of 
just one-to-one?  And maybe try the scenario when the filter swallows it (e.g. 
WDF sees a token that is simply a symbol).  The documentation is ok but I was 
confused about practically what would usage look like until I looked at the 
test, so maybe a simple example in the class javadocs could shed light on this. 

With such a general utility, I wonder if the existing TokenFilters that have 
precondition checks (e.g. stemmers that check conditions) needn't bother doing 
this anymore since you could wrap the stemmer with the BypassingTokenFilter 
here with a check if the word is in a list?  Then we wouldn't even need 
KeywordAttribute!  I realize this is taking your simple proposal and taking it 
very far but I think it's worth discussing for 8.0.

An alternative to your BypassingTokenFilter is creating an intermediate base 
class between existing TokenFilters that bypass (e.g. stemmers + ones that 
ought to like WDF) and TokenFilter.  But thinking about this more, this seems 
like a bigger disruptive change and wouldn't cast a net as wide as 
BypassingTokenFilter which can filter anything, even filters where the author 
forgot to consider being filtered.

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Michael Sokolov
+1

On Tue, Apr 24, 2018 at 9:58 AM, Alan Woodward (JIRA) 
wrote:

>
> [ https://issues.apache.org/jira/browse/LUCENE-8273?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=16449897#comment-16449897 ]
>
> Alan Woodward commented on LUCENE-8273:
> ---
>
> I added this to core rather than to the analysis module as it seems to me
> to be a utility class like FilteringTokenFilter, which is also in core.
> But I'm perfectly happy to move it to analysis-common if that makes more
> sense to others.
>
> > Add a BypassingTokenFilter
> > --
> >
> > Key: LUCENE-8273
> > URL: https://issues.apache.org/jira/browse/LUCENE-8273
> > Project: Lucene - Core
> >  Issue Type: New Feature
> >Reporter: Alan Woodward
> >Priority: Major
> > Attachments: LUCENE-8273.patch
> >
> >
> > Spinoff of LUCENE-8265.  It would be useful to be able to wrap a
> TokenFilter in such a way that it could optionally be bypassed based on the
> current state of the TokenStream.  This could be used to, for example, only
> apply WordDelimiterFilter to terms that contain hyphens.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449897#comment-16449897
 ] 

Alan Woodward commented on LUCENE-8273:
---

I added this to core rather than to the analysis module as it seems to me to be 
a utility class like FilteringTokenFilter, which is also in core.  But I'm 
perfectly happy to move it to analysis-common if that makes more sense to 
others.

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449893#comment-16449893
 ] 

Alan Woodward commented on LUCENE-8273:
---

Here's a patch.

> Add a BypassingTokenFilter
> --
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org