[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461774#comment-16461774
 ] 

Matt Weber commented on LUCENE-8284:


Attached a patch that adds an expansion limit per-segment and just gathers the 
first terms we come across.  Not sure I like this, I am going to try a version 
that adds a rewrite method to {{IntervalsSource}} so we can use the existing 
rewrite methods including the one [~dsmiley] mentioned.

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch, LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461601#comment-16461601
 ] 

David Smiley commented on LUCENE-8284:
--

LUCENE-6513 is related to more intelligently calculate the top-terms using DF, 
TTF.  I have some WIP refreshing of an existing patch on that ticket.

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461126#comment-16461126
 ] 

Jim Ferenczi commented on LUCENE-8284:
--

.bq Is there anything I can do to move this forward? Add an expansion limit? 
Rewrite support?

Yes, as I said in my previous comment we should have a way to limit the 
expansion through top terms rewrite.
I discussed with [~jpountz] and [~romseygeek] and we agreed that the limit 
should be explicit (a parameter of the source) and that we should never create 
a disjunction that is bigger than BooleanQuery#MAX_CLAUSE_COUNT (hard limit for 
the max expansion). If these restrictions are added then we have no objections 
to add these kind of sources to intervals :).

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461092#comment-16461092
 ] 

Matt Weber commented on LUCENE-8284:


[~jpountz] [~jim.ferenczi]

I disagree.  There are many cases where they are not expensive and/or I, as a 
user, understand the consequences and am willing to live with it.  Indexing 
techniques (ngrams, etc) will only go so far and there are many cases where 
they might actually introduce issues once your not working on a tiny dataset.  
I feel the type of restriction or optimizations you talk about should be added 
at the usage level, ie. Solr or Elasticsearch.

Is there anything I can do to move this forward?  Add an expansion limit?  
Rewrite support?

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460819#comment-16460819
 ] 

Jim Ferenczi commented on LUCENE-8284:
--

I agree with Adrien, just exposing multi term queries without limitations is 
not going to scale and will certainly produce OOME or very slow query.
(edge) ngrams indexing should be used for prefixes and infixes matching but it 
won't work for fuzzy or regex queries so another option would be to accept 
multi term queries but only if they use top terms rewriting.
So it would only select the top terms (and we can limit the number to 
Boolean.MAX_CLAUSE_COUNT) and translate them into term intervals ? 
maxExpansions should be a mandatory parameter for this kind of source in order 
to make sure that users are aware that only a subset of the matching terms are 
considered.

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-05-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460767#comment-16460767
 ] 

Adrien Grand commented on LUCENE-8284:
--

I don't think we should expose intervals on multi-term queries: it doesn't 
scale. I know spans have SpanMultiTermQuery which does the same, but I've seen 
users having issue with it as soon as they weren't working on a tiny dataset 
anymore. Intervals on prefixes or infixes should be doable by indexing (edge) 
ngrams, maybe there are things we could do in order to make it easier to use? 
This might be just documentation.

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8284) Add MultiTermsIntervalsSource

2018-04-30 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458673#comment-16458673
 ] 

Matt Weber commented on LUCENE-8284:


[~romseygeek] [~jimczi]

Since these expand terms per-segment the terms are not available when creating 
the {{IntervalWeight}} and thus result in a null {{simScorer}} if these are the 
only sources.  I currently picked using a constant {{1.0f}} in this case.  Not 
sure if this is the best approach or not.

> Add MultiTermsIntervalsSource
> -
>
> Key: LUCENE-8284
> URL: https://issues.apache.org/jira/browse/LUCENE-8284
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Matt Weber
>Priority: Minor
> Attachments: LUCENE-8284.patch
>
>
> Add support for creating an {{IntervalsSource}} from multi-term expansions 
> such as wildcards, regular expressions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org