[jira] [Commented] (SOLR-13094) NPE while doing regular Facet

2019-09-13 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929379#comment-16929379
 ] 

Michael Gibney commented on SOLR-13094:
---

It looks like there may be other things going on here, but based on the 
information included in the description, it's worth pointing out that there are 
fundamental issues with distributed faceting over SortableTextField, described 
in SOLR-13056. I realize that these issues may not be directly related to the 
problem(s?) described/linked to here.

But for now it might be best to avoid distributed faceting over 
SortableTextField (e.g., separate fields for sorting and searching/faceting), 
and doing so might address your immediate issue.

> NPE while doing regular Facet
> -
>
> Key: SOLR-13094
> URL: https://issues.apache.org/jira/browse/SOLR-13094
> Project: Solr
>  Issue Type: Bug
>  Components: Facet Module
>Affects Versions: 7.5.0
>Reporter: Amrit Sarkar
>Priority: Major
>
> I am issuing a regular facet query:
> {code}
> params = new ModifiableSolrParams()
> .add("q", query.trim())
> .add("rows", "0")
> .add("facet", "true")
> .add("facet.field", "description")
> .add("facet.limit", "200");
> {code}
> Exception:
> {code}
> 2018-12-24 15:50:20.843 ERROR (qtp690521419-130) [c:wiki s:shard2 
> r:core_node4 x:wiki_shard2_replica_n2] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.common.SolrException: Exception during facet.field: 
> description
>   at 
> org.apache.solr.request.SimpleFacets.lambda$getFacetFieldCounts$0(SimpleFacets.java:832)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at org.apache.solr.request.SimpleFacets$3.execute(SimpleFacets.java:765)
>   at 
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:841)
>   at 
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:329)
>   at 
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:273)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
>   at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
>   at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
>   at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
>   at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at org.eclipse.jetty.server.Server.handle(Server.java:531)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
>   at 
> 

[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-09 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925841#comment-16925841
 ] 

Michael Gibney commented on LUCENE-8972:


For consideration, I believe this issue has already been tackled by [~cbeer] 
and [~mejackreed]; the resulting implementation can be found 
[here|https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java].

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-09 Thread Michael Gibney (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8972:
---
Summary: CharFilter version of ICUTransformFilter, to better support 
dictionary-based tokenization  (was: CharFilter version ICUTransformFilter, to 
better support dictionary-based tokenization)

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8972) CharFilter version ICUTransformFilter, to better support dictionary-based tokenization

2019-09-09 Thread Michael Gibney (Jira)
Michael Gibney created LUCENE-8972:
--

 Summary: CharFilter version ICUTransformFilter, to better support 
dictionary-based tokenization
 Key: LUCENE-8972
 URL: https://issues.apache.org/jira/browse/LUCENE-8972
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 8.2, master (9.0)
Reporter: Michael Gibney


The ICU Transliteration API is currently exposed through Lucene only 
post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
dictionary-based) may assume pre-normalized input (e.g., for Chinese 
characters, there may be an assumption of traditional-only or simplified-only 
input characters, at the level of either all input, or 
per-dictionary-defined-token).

The potential usefulness of a CharFilter that exposes the ICU Transliteration 
API was suggested in a [thread on the Solr mailing 
list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
 and my hope is that this issue can facilitate more detailed discussion of the 
proposed addition.

A concrete example of mixed traditional/simplified characters that are 
currently tokenized differently by the ICUTokenizer are:
 * 红楼梦 (SSS)
 * 紅樓夢 (TTT)
 * 紅楼夢 (TST)

The first two tokens (simplified-only and traditional-only, respectively) are 
included in the [CJ dictionary that backs 
ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
 but the last (a mixture of traditional and simplified characters) is not, and 
is not recognized as a token. Even _if_ we assume this to be an intentional 
omission from the dictionary that results in behavior that could be desirable 
for some use cases, there are surely some use cases that would benefit from a 
more permissive dictionary-based tokenization strategy (such as could be 
supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13714) Incorrect shardHandlerFactory config element documented in refguide for "distributed requests"

2019-08-23 Thread Michael Gibney (Jira)
Michael Gibney created SOLR-13714:
-

 Summary: Incorrect shardHandlerFactory config element documented 
in refguide for "distributed requests"
 Key: SOLR-13714
 URL: https://issues.apache.org/jira/browse/SOLR-13714
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: documentation
Affects Versions: 8.1.1, 7.7.2
Reporter: Michael Gibney


Reference guide documentation is inconsistent with respect to configuration of 
{{shardHandlerFactory}} in {{solrconfig.xml}}.

The correct config element name is "{{shardHandlerFactory}}", as reflected in 
code [in 
SolrXmlConfig.java|https://github.com/apache/lucene-solr/blob/301ea0e/solr/core/src/java/org/apache/solr/core/SolrXmlConfig.java#L460]
 and [in 
SearchHandler.java|https://github.com/apache/lucene-solr/blob/43fc05c/solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java#L97].

The element name is documented correctly in the [refGuide page for "Format of 
solr.xml"|https://lucene.apache.org/solr/guide/8_1/format-of-solr-xml.html#the-shardhandlerfactory-element],
 but it is documented incorrectly (as "{{shardHandler}}", not 
"{{shardHandlerFactory}}" in the [refGuide page for "Distributed 
Requests"|https://lucene.apache.org/solr/guide/8_1/distributed-requests.html#configuring-the-shardhandlerfactory].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-08-21 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912638#comment-16912638
 ] 

Michael Gibney commented on SOLR-13257:
---

Thanks [~tomasflobbe] and [~cpoerschke]!

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Michael Gibney
>Assignee: Tomás Fernández Löbbe
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: AffinityShardHandlerFactory.java, SOLR-13257.patch, 
> SOLR-13257.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-08-17 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909874#comment-16909874
 ] 

Michael Gibney commented on SOLR-13257:
---

No more changes; looks ready to merge from my perspective. Thanks!

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Michael Gibney
>Assignee: Tomás Fernández Löbbe
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java, SOLR-13257.patch, 
> SOLR-13257.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-08-05 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900547#comment-16900547
 ] 

Michael Gibney commented on SOLR-13257:
---

Thanks [~tomasflobbe] and [~cpoerschke] for your feedback, and for helping move 
this feature along! I just pushed some commits to address some points raised by 
[~cpoerschke]: opting for stricter handling of the {{dividend}} param (if 
present, it must be an integer), and adding/clarifying documentation and 
javadocs.

Most of the changes have to do with added documentation. One functional change 
was to add support for the ability to explicitly disable implicit hash-based 
routing (by setting the {{hash}} param config to the empty String, in 
solr.xml). This idea came up as a consequence of thinking about edge cases 
while documenting the behavior of replica routing configuration in solr.xml.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Michael Gibney
>Assignee: Tomás Fernández Löbbe
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java, SOLR-13257.patch, 
> SOLR-13257.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12532) Slop specified in query string is not preserved for certain phrase searches

2019-07-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893976#comment-16893976
 ] 

Michael Gibney commented on SOLR-12532:
---

[~steve_rowe], I'd be curious to know what you think of this issue/PR, if you 
have a chance to take a look.

> Slop specified in query string is not preserved for certain phrase searches
> ---
>
> Key: SOLR-12532
> URL: https://issues.apache.org/jira/browse/SOLR-12532
> Project: Solr
>  Issue Type: Bug
>  Components: query parsers
>Affects Versions: 7.4
>Reporter: Brad Sumersford
>Priority: Major
> Attachments: SOLR-12532.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Note: This only impacts specific settings for the WordDelimiterGraphFilter as 
> detailed below.
> When a phrase search is parsed by the SolrQueryParser, and the phrase search 
> results in a graph token stream, the resulting SpanNearQuery created does not 
> have the slop correctly set.
> h4. Conditions
>  - Slop provided in query string (ex: ~2")
>  - WordDelimiterGraphFilterFactory with query time preserveOriginal and 
> generateWordParts
>  - query string includes a term that contains a word delimiter
> h4. Example
> Field: wdf_partspreserve 
>  – WordDelimiterGraphFilterFactory 
>   preserveOriginal="1"
>   generateWordParts="1"
> Data: you just can't
>  Search: wdf_partspreserve:"you can't"~2 -> 0 Results
> h4. Cause
> The slop supplied by the query string is applied in 
> SolrQueryParserBase#getFieldQuery which will set the slop only for 
> PhraseQuery and MultiPhaseQuery. Since "can't" will be broken down into 
> multiple tokens analyzeGraphPhrase will be triggered when the Query is being 
> constructed which will return a SpanNearQuery instead of a (Multi)PhraseQuery.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2019-07-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893958#comment-16893958
 ] 

Michael Gibney commented on SOLR-7798:
--

[~joel.bernstein], I circled back to this, and squash-rebased [PR 
325|https://github.com/apache/lucene-solr/pull/325] on current master. The 
patch applies cleanly and passes precommit and all tests, so it should be 
solid. I'm sorry for the false start (in Feb. 2018); if you'd be willing to 
take another look at this, I think this will now _actually_ be as 
straightforward as it initially should have been!

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-25 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892990#comment-16892990
 ] 

Michael Gibney commented on LUCENE-4312:


For the sake of facilitating discussion around something more concrete, I 
uploaded a patch ([^positionLength-postings.patch]) for a straw-man proposal 
for {{PostingsEnum}} modifications to support position length (also visible as 
a pseudo-PR here: [https://github.com/magibney/lucene-solr/pull/1]). The patch 
won't compile, of course (no corresponding modifications to subclasses of 
{{PostingsEnum}}).

The proposal goes a bit beyond simply adding a {{positionLength()}} method, 
with a few additional fundamental methods to support optimizations that proved 
helpful in implementing performant positional queries (for LUCENE-7398).

Any feedback would be much appreciated, especially given the acknowledged 
provisional (and potentially controversial) nature of this proposal.

[~jpountz], I've given some more thought to the challenge of not being able to 
"advance positions on terms in the order we want anymore". I think there should 
be a general-purpose way to preserve this ability (in a way that doesn't depend 
on the kind of corpus-specific shingle-based filtering that I previously 
suggested). I'm considering an approach leveraging something analogous to a 
reverse token filter, except rather than reversing the token text, it (sort of) 
reverses start/end positions: start position of the new token is end position 
of the original token, and end position of the new token is 
{{originalEndPosition + positionLength}}. Then you could use the least-cost 
term as an entrypoint, and build forward with original tokens, backward with 
the modified-positions tokens. Query implementation would be responsible for 
properly interpreting flipped positions.

 

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
> Attachments: positionLength-postings.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4312) Index format to store position length per position

2019-07-25 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-4312:
---
Attachment: positionLength-postings.patch

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
> Attachments: positionLength-postings.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-11 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883103#comment-16883103
 ] 

Michael Gibney commented on LUCENE-4312:


That PR looks like a good start, [~romseygeek].

I'd still like to have a better sense of what bar we're trying to clear as far 
as demonstrating the usefulness of recording position length in the index. It 
seems we have consensus (basically unchanged since August of 2012), that:
 # Recording position length in the index is relatively straightforward, 
whether done via codecs and modified postings API or via Payloads and custom 
Intervals/Spans implementations.
 # Performant positional query implementation leveraging position length is a 
significant challenge.

Would people be comfortable evaluating usefulness/performance based on the 
Spans implementation proposed for LUCENE-7398, as opposed to holding out for an 
Intervals query implementation? Particularly if we're talking proof-of-concept, 
all of the performance implications are directly comparable (between Spans and 
Intervals).

Is there a _conditional_ consensus that indexed position length would be a 
worthwhile feature, assuming the implementation of performant positional 
queries to leverage it?

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-07-10 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882628#comment-16882628
 ] 

Michael Gibney commented on SOLR-13257:
---

Thanks, [~cpoerschke] and [~tomasflobbe]! Responses shortly to PR comments.

[~cpoerschke], I think the examples (and answers) you set out are all 
illustrative and accurate (with the exception that the syntax would be 
{{shards.preference=replica.base:stable:dividend:fooBar=0}} or 
{{shards.preference=replica.base:stable:hash:fooBar=0)}}

A couple of comments on the initial example cases (though again it seems that 
the the answers you provide are accurate as far as I understand it):

2.Example: opt for possibility 2: It's a nitpicky semantic point, but the 
deterministic preference param doesn't ever really resolve directly to a 
particular replica, rather it resolves to a particular rotation of the replica 
preferences list. The first of these will in most (all?) cases be chosen to 
serve the request, but there's nothing special about the "top" replica _per 
se_, and thus nothing different about the _other_ replicas (that would cause 
them to be sorted differently).

4. "Does the new shard affinity logic influence the ordering within the end 
portion of the list too?"
Yes; each grouping of otherwise equivalent options is sorted and 
deterministically rotated according to the specified affinity param.

5. Same as 4; the new affinity logic only affects groupings of options that are 
otherwise (according to specified {{shards.preference}} param) considered to be 
equivalent; and it affects all such groupings, no matter how many hierarchical 
preferences are specified, or how many otherwise-equivalent groups there are.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java, SOLR-13257.patch, 
> SOLR-13257.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-09 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881477#comment-16881477
 ] 

Michael Gibney commented on LUCENE-4312:


This sounds potentially like a good way to proceed. I appreciate the need for a 
high bar for getting things into the index – I suppose I was invoking 
"chicken/egg" not directly as an argument for inclusion in the index, but 
rather to highlight the interdependence of these features.

Essentially all of the proof-of-concept work that we're discussing here is 
already implemented as part of the LUCENE-7398 work, and has been running in 
production (and iteratively improved) for over a year. Before proceeding, I'd 
like to get some consensus on what the best way is to move forward, and also 
perhaps have some discussion of what bar we have in mind for "once these prove 
useful".

Regarding usefulness, and the question of to what extent this represents a 
corner-case: anybody interested in index-time synonyms and precise positional 
queries needs this feature. So in some sense this boils down to a question of 
the usefulness of index-time synonyms (or other index-time TokenStream graphs) 
... and since the standing recommendation has for some time been to _avoid_ 
using index-time synonyms, we have another chicken/egg :). I can say that this 
has been considerably helpful in my use case, and the problem it addresses is 
at the root of a number of consistently reported issues, among users of both 
[Elasticsearch|https://discuss.elastic.co/t/not-getting-results-from-a-phrase-query-using-query-string-of-the-form-x-a1-abc-in-6-6-0/179191]
 and 
[Solr|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201905.mbox/%3CCAF%3DheHETPCqxUcqyu13tFfKFcALzD__-QrToRBP-VVWh1S3-Wg%40mail.gmail.com%3E].

Practically speaking, I'm wondering what's the best way to get the most eyes on 
this feature set, with the goal of evaluating its usefulness and performance. 
The fix as currently implemented is basically a wholesale rewrite of some of 
the Spans classes, but it seeks to correctly support existing Spans contracts; 
implemented as a branch, I was able to rely on existing tests against various 
Spans. For performance reasons, changes were also introduced in indexing code 
(e.g., DefaultIndexingChain). For these reasons, my sense is that it would be 
quite challenging to extract these features into the sandbox module. Even if 
such "sandbox" extraction were possible, it would render the task of evaluation 
more difficult for all but the most dedicated users (currently it suffices to 
run a forked build to swap out the backend Spans implementation in all of the 
parsers and components that rely on Spans). Could these potentially be reasons 
to opt for a "branch" approach (as opposed to "sandbox")?

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-08 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880635#comment-16880635
 ] 

Michael Gibney commented on LUCENE-4312:


True, both good points. But it's kind of a chicken-or-egg situation ... there 
would have been no point to address these implied challenges, so long as 
position length has not been recorded in the index (and is thus not available 
at query time). That doesn't mean there _aren't_ ways to address the challenges.

Regarding the "A B C" example, I addressed this in the LUCENE-7398 work by 
indexing next start position as a lookahead. As a proof of concept this was 
done with Payloads, but in principle I could see slight modifications 
(somewhere at the intersection of codecs and postings API) that would natively 
read next start position "early" and expose it as a lookahead. This would avoid 
the type of problematic call to {{PostingsEnum.nextPosition()}} that would (as 
you correctly point out) result in the need to buffer all information 
associated with _every_ position. I've described this approach in more detail 
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#index-lookahead-don-t-buffer-positions-if-you-don-t-have-to].
{quote}we can't advance positions on terms in the order we want anymore.
{quote}
Yes, I'd argue that's the toughest challenge. I addressed it indirectly by 
constructing CommonGrams-style shingles used specifically for pre-filtering 
conjunctions in the "approximation" phase of two-phase iteration (ensuring that 
common terms at subclause index 0 don't kill performance). This is described in 
more detail 
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#shingle-based-pre-filtering-of-conjunctionspans].

I'm not intending this to be about these particular solutions, and you might 
take issue with the solutions themselves. The more general point I guess is 
that indexed position length is fundamental, and is a prerequisite for the 
development of ways to address these challenges.

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-07-08 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13257:
--
Attachment: SOLR-13257.patch
Status: Patch Available  (was: Patch Available)

New patch to address test failures.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java, SOLR-13257.patch, 
> SOLR-13257.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-08 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880490#comment-16880490
 ] 

Michael Gibney commented on LUCENE-4312:


Thank you for the feedback, [~sokolov] and [~jpountz]!
{quote}Recording position lengths in the index is the easy part of the problem 
in my opinion.
{quote}
Yes, this is my view as well; and looking to the future, _respecting_ position 
length would certainly add complexity to phrase queries. But in terms of 
performance impact, the complexity of query execution would be driven by what's 
actually in the index (so for many use cases performance should be roughly 
equivalent to that of an implementation that ignores position length).

Regarding the challenges of query implementation... I'm taking a fresh look at 
this issue in the context of work done on LUCENE-7398, which seeks to implement 
backtracking phrase queries in an efficient way (including sloppy, nested, 
etc.). Despite that issue being nominally about "nested Span queries", it's 
really more generally about "proximity search over variable-length subclauses", 
and the techniques used in the implementation for LUCENE-7398 would be 
transferable to interval queries as well.

It's a fair point about the arbitrariness of sloppy phrase queries with 
intervening multi-term synonyms, but I wouldn't call such queries 
"meaningless"; in any case, I think that problem already exists for multi-term 
indexed synonyms, and is not exacerbated by the introduction of indexed 
position length. Sloppy phrase queries (and, for that matter, tokenization 
itself) are somewhat arbitrary by nature. Following that tangent, I can imagine 
some potential ways to mitigate such arbitrariness ... all of which themselves 
rely on the ability to index token graph structure (i.e., position length).

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-05 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879425#comment-16879425
 ] 

Michael Gibney commented on LUCENE-4312:


Following up on discussion at Berlin Buzzwords with [~mikemccand], [~sokolov], 
[~simonw], and [~romseygeek]:

A lot of useful context (for, e.g., synonym generation, etc.) is available at 
index time that is not available at query time. Leveraging this context can 
result in index-time TokenStream manipulations that produce token graphs. Since 
position length is not indexed, it is impossible at query time to reconstruct 
index-time TokenStream "graph" structure.

Indexed position length is a prerequisite for any use case that calls for:
1. index-time graph TokenStreams
2. precise/accurate proximity query (via spans, intervals, etc.)

Could we discuss adding first-class support for this structural "position 
length" information?

Updating PostingsEnum to include endPosition() -- returning {{position+1}} by 
default -- would be a meaningful first step. This would facilitate the 
development of query implementations without requiring an API fork, and would 
signal an intention to move in the direction of supporting index-time token 
graphs.

Beyond that, I'm optimistic that codecs could be enhanced to index position 
length without introducing much additional overhead (I'd guess that position 
length for the common case of linear/non-graph index-time token streams could 
compress quite well).

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-06-28 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13132:
--
Status: Open  (was: Patch Available)

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-05-14 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839653#comment-16839653
 ] 

Michael Gibney commented on SOLR-13257:
---

[PR created|https://github.com/apache/lucene-solr/pull/677]. This ended up 
being a bit more involved than I'd anticipated, because I ended up integrating 
replica affinity with {{shards.preference}}. This seemed like the most natural 
place to add this functionality, and it will allow deterministic routing to be 
used in conjunction with other replica routing preferences (as a 
fallback/tie-breaker). There's also support for configuring default "base 
routing" preference in solrconfig.xml.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java
>
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12243) Edismax missing phrase queries when phrases contain multiterm synonyms

2019-05-09 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836448#comment-16836448
 ] 

Michael Gibney commented on SOLR-12243:
---

[~fmr], there are several issues relevant to the problem you've encountered:

Multi-term synonyms invoke graphPhraseQuery, implemented for {{6.5 <= _version_ 
< 7.6}} as SpanNearQuery, which was not prone to exponential growth; but (also 
prior to 7.6) that SpanNearQuery was completely ignored. It's the latter 
problem (ignoring) that this issue (SOLR-12243) fixes.

The exponential expansion is related to LUCENE-8531, which reverts LUCENE-7699 
by changing the SpanNearQuery graph phrase query implementation back to the 
pre-6.5 MultiPhraseQuery implementation (when slop>0), for semantic 
compatibility reasons.

MultiPhraseQuery is inherently susceptible to exponential expansion, so there 
is no workaround at the moment to fully support a high degree of synonym 
expansion in conjunction with slop>0. Regarding the manifestation of the 
problem as "single query taking down an entire solr-server", this should be 
mitigated starting in 8.1 (see SOLR-13336). Individual queries will still fail 
if expanded beyond a configurable threshold (number of clauses), but the type 
of systemic problem that you encountered will be prevented.

Regarding a potential longer-term solution, it might be worth looking at 
LUCENE-8544.

> Edismax missing phrase queries when phrases contain multiterm synonyms
> --
>
> Key: SOLR-12243
> URL: https://issues.apache.org/jira/browse/SOLR-12243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.1
> Environment: RHEL, MacOS X
> Do not believe this is environment-specific.
>Reporter: Elizabeth Haubert
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, 8.0
>
> Attachments: SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, 
> SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, 
> multiword-synonyms.txt, schema.xml, solrconfig.xml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> synonyms.txt:
> {code}
> allergic, hypersensitive
> aspirin, acetylsalicylic acid
> dog, canine, canis familiris, k 9
> rat, rattus
> {code}
> request handler:
> {code:xml}
> 
>  
> 
>  edismax
>   0.4
>  title^100
>  title~20^5000
>  title~11
>  title~22^1000
>  text
>  
>  3-1 6-3 930%
>  *:*
>  25
> 
> 
> {code}
> Phrase queries (pf, pf2, pf3) containing "dog" or "aspirin"  against the 
> above list will not be generated.
> "allergic reaction dog" will generate pf2: "allergic reaction", but not 
> pf:"allergic reaction dog", pf2: "reaction dog", or pf3: "allergic reaction 
> dog"
> "aspirin dose in rats" will generate pf3: "dose ? rats" but not pf2: "aspirin 
> dose" or pf3:"aspirin dose ?"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4312) Index format to store position length per position

2019-04-29 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829599#comment-16829599
 ] 

Michael Gibney edited comment on LUCENE-4312 at 4/29/19 6:26 PM:
-

(following on discussion at 

[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-04-29 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829599#comment-16829599
 ] 

Michael Gibney commented on LUCENE-4312:


(following on discussion at 

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

2019-04-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827142#comment-16827142
 ] 

Michael Gibney commented on LUCENE-8776:


{quote}Unfortunately, you cannot properly index a token graph: Lucene discards 
the PositionLengthAttribute which is why if you really want to index a token 
graph you should insert a FlattenGraphFilter at the end of your chain. This 
still discards information (loses the graph-ness) but tries to do so minimizing 
how queries are broken.
{quote}
Would there be any interest revisiting LUCENE-4312: adding support for indexed 
position length (PositionLengthAttribute)?

Since Ram's is an index-time use case, I see only two options:
 1. [~mikemccand]'s suggestion, which would compromise phrase query accuracy 
(e.g., missing "light-emitting-diode glows"), and
 2. [~jpountz]'s initial suggestion, which would compromise highlighting 
precision.

For graph TokenStreams, indexed position length could be used to fully address 
(as opposed to minimize) "how queries are broken", in addition to avoiding the 
tradeoff/compromise in the case described here. It would also enable index-time 
multi-term synonyms, etc ... 

> Start offset going backwards has a legitimate purpose
> -
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.6
>Reporter: Ram Venkat
>Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8776) Start offset going backwards has a legitimate purpose

2019-04-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825202#comment-16825202
 ] 

Michael Gibney edited comment on LUCENE-8776 at 4/24/19 2:20 PM:
-

Ram, it's good that this solution worked for you, but taking a step back, your 
solution seems like a workaround for LUCENE-7398 and LUCENE-4312. Workarounds 
aren't inherently _bad_ of course, but when they depend on ambiguity of (or 
lack of enforcement of) contracts, backward compatibility can't be guaranteed 
(to paraphrase what I take Robert and Adrien to be saying).

Of course, one person's "patch" is another person's "workaround", but I'd be 
curious to know whether any of the ["LUCENE-7398/*" 
branches|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_6] 
might help for your use case. (There's a high-level description in [this 
comment on the LUCENE-7398 
issue|https://issues.apache.org/jira/browse/LUCENE-7398?focusedCommentId=16630529#comment-16630529]).
 Particularly relevant to this discussion: the patch supports recording token 
positionLength in the index, and enforces index ordering by startPosition and 
endPosition (compatible with ordering specified for the Spans API).


was (Author: mgibney):
Ram, it's good that this solution worked for you, but taking a step back, your 
solution seems like a workaround for LUCENE-7398 and LUCENE-4312. Workarounds 
aren't inherently _bad_ of course, but when they depend on ambiguity of (or 
lack of enforcement of) contracts, backward compatibility can't be guaranteed 
(to paraphrase what I take Robert and Adrien to be saying).

Of course, one person's "patch" is another person's "workaround", but I'd be 
curious to know whether any of the ["LUCENE-7398/*" 
branches|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_6] 
might help for your use case. (There's a high-level description in this comment 
on the LUCENE-7398 issue). Particularly relevant to this discussion: the patch 
supports recording token positionLength in the index, and enforces index 
ordering by startPosition and endPosition (compatible with ordering specified 
for the Spans API).

> Start offset going backwards has a legitimate purpose
> -
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.6
>Reporter: Ram Venkat
>Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

2019-04-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825202#comment-16825202
 ] 

Michael Gibney commented on LUCENE-8776:


Ram, it's good that this solution worked for you, but taking a step back, your 
solution seems like a workaround for LUCENE-7398 and LUCENE-4312. Workarounds 
aren't inherently _bad_ of course, but when they depend on ambiguity of (or 
lack of enforcement of) contracts, backward compatibility can't be guaranteed 
(to paraphrase what I take Robert and Adrien to be saying).

Of course, one person's "patch" is another person's "workaround", but I'd be 
curious to know whether any of the ["LUCENE-7398/*" 
branches|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_6] 
might help for your use case. (There's a high-level description in this comment 
on the LUCENE-7398 issue). Particularly relevant to this discussion: the patch 
supports recording token positionLength in the index, and enforces index 
ordering by startPosition and endPosition (compatible with ordering specified 
for the Spans API).

> Start offset going backwards has a legitimate purpose
> -
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.6
>Reporter: Ram Venkat
>Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13336) maxBooleanClauses ignored; can result in exponential expansion of naive queries

2019-04-10 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814727#comment-16814727
 ] 

Michael Gibney commented on SOLR-13336:
---

+1, the patch looks good to me; thanks!

> maxBooleanClauses ignored; can result in exponential expansion of naive 
> queries
> ---
>
> Key: SOLR-13336
> URL: https://issues.apache.org/jira/browse/SOLR-13336
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.0, 7.6, master (9.0)
>Reporter: Michael Gibney
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13336.patch, SOLR-13336.patch
>
>
> Since SOLR-10921 it appears that Solr always sets 
> {{BooleanQuery.maxClauseCount}} (at the Lucene level) to 
> {{Integer.MAX_VALUE-1}}. I assume this is because Solr parses 
> {{maxBooleanClauses}} out of the config and applies it externally.
> In any case, when used as part of 
> {{lucene.util.QueryBuilder.analyzeGraphPhrase}} (and possibly other places?), 
> the Lucene code checks internally against only the static {{maxClauseCount}} 
> variable (permanently set to {{Integer.MAX_VALUE-1}} in the context of Solr).
> Thus in at least one case ({{analyzeGraphPhrase()}}, but possibly others?), 
> {{maxBooleanClauses}} is having no effect. I'm pretty sure this is what's 
> underlying the [issue reported here as being related to Solr 
> 7.6|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201902.mbox/%3CCAF%3DheHE6-MOtn2XRbEg7%3D1tpNEGtE8GaChnOhFLPeJzpF18SGA%40mail.gmail.com%3E].
> To summarize, users are definitely susceptible (to varying degrees of likely 
> severity, assuming no actual _malicious_ attack) if:
>  # Running Solr >= 7.6.0
>  # Using edismax with "ps" param set to >0
>  # Query-time analysis chain is _at all_ capable of producing graphs (e.g., 
> WordDelimiterGraphFilter, SynonymGraphFilter that has corresponding synonyms 
> with varying token lengths.
> Users are _particularly_ vulnerable in practice if they have query-time 
> {{WordDelimiterGraphFilter}} configured with {{preserveOriginal=true}}.
> To clarify, Lucene/Solr 7.6 didn't exactly _introduce_ the issue; it only 
> increased the likelihood of problems manifesting (as a result of 
> LUCENE-8531). Notably, the "enumerated strings" approach to graph phrase 
> query (reintroduced by LUCENE-8531) was previously in place pre-6.5 – at 
> which point it could rely on default Lucene-level {{maxClauseCount}} failsafe 
> (removed as of 7.0). This explains the odd "Affects versions" => 
> maxBooleanClauses was disabled at the Lucene level (in Solr contexts) 
> starting with version 7.0, but the change became more likely to manifest 
> problems for users as of 7.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-04-01 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806843#comment-16806843
 ] 

Michael Gibney commented on SOLR-13257:
---

Thanks for the feedback, [~janhoy]! Good point re: {{getReplicaId()}} -- I 
think that was "proof-of-concept" code that escaped my notice, and should no 
doubt be replaced. I'm happy to submit as a PR, and will work on a test.
{quote}I wonder if this perhaps is such a common use case that we could 
consider folding it into the default shardHandler as a configuration option.
{quote}
I had been wondering this as well. I think that would make sense, but I wanted 
to get something out there initially that people could use without patching 
core classes. I think for a PR I'll try the "config option in default 
shardHandler" approach. Will be easy to extract again if necessary.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java
>
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12532) Slop specified in query string is not preserved for certain phrase searches

2019-03-22 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799146#comment-16799146
 ] 

Michael Gibney commented on SOLR-12532:
---

+1

This extends the fix from SOLR-12243 (which only applied to slop as set in 
{{ps}}) to cover slop specified directly in the query string. It follows the 
pattern established for {{(Multi)PhraseQuery}} in 
{{ExtendedDismaxQParser.ExtendedSolrQueryParser.getQuery()}} (i.e., re-building 
the query using the slop as parsed out of the query string).

> Slop specified in query string is not preserved for certain phrase searches
> ---
>
> Key: SOLR-12532
> URL: https://issues.apache.org/jira/browse/SOLR-12532
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.4
>Reporter: Brad Sumersford
>Priority: Major
> Attachments: SOLR-12532.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Note: This only impacts specific settings for the WordDelimiterGraphFilter as 
> detailed below.
> When a phrase search is parsed by the SolrQueryParser, and the phrase search 
> results in a graph token stream, the resulting SpanNearQuery created does not 
> have the slop correctly set.
> h4. Conditions
>  - Slop provided in query string (ex: ~2")
>  - WordDelimiterGraphFilterFactory with query time preserveOriginal and 
> generateWordParts
>  - query string includes a term that contains a word delimiter
> h4. Example
> Field: wdf_partspreserve 
>  – WordDelimiterGraphFilterFactory 
>   preserveOriginal="1"
>   generateWordParts="1"
> Data: you just can't
>  Search: wdf_partspreserve:"you can't"~2 -> 0 Results
> h4. Cause
> The slop supplied by the query string is applied in 
> SolrQueryParserBase#getFieldQuery which will set the slop only for 
> PhraseQuery and MultiPhaseQuery. Since "can't" will be broken down into 
> multiple tokens analyzeGraphPhrase will be triggered when the Query is being 
> constructed which will return a SpanNearQuery instead of a (Multi)PhraseQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13056) SortableTextField is trappy for faceting

2019-03-01 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782092#comment-16782092
 ] 

Michael Gibney commented on SOLR-13056:
---

Oh, but I think for distributed refinement against {{SortableTextField}}, 
there's a further problem: the initial pass works in an "uninverted" way (using 
DocValues), but refinement constructs TermQueries to run against the _inverted_ 
index, which contains the tokenized text. So this simply swaps an approach that 
got lots of _spurious_ hits during refinement, for an approach that will miss 
lots of _legitimate_ hits during refinement? ... I think this is the problem 
that [~toke] identified as the major impediment for this issue?

> SortableTextField is trappy for faceting
> 
>
> Key: SOLR-13056
> URL: https://issues.apache.org/jira/browse/SOLR-13056
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 7.6
>Reporter: Toke Eskildsen
>Priority: Major
> Attachments: SOLR-13056.patch
>
>
> Using {{SortableTextField}} for distributed faceting can lead to wrong 
> results. This can be demonstrated by installing the cloud-version of the 
> {{gettingstarted}} sample with
> {{./solr -e cloud}}
> using defaults all the way, except for {{shards}} which should be {{3}}. 
> After that a corpus can be indexed with
> {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
> "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
> '\{"id":"duplicate_1","facet_t_sort":"a 
> b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
> 'Content-Type: application/json' 
> 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
> This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
> X"}} where X is the document number + 2 documents with {{facet_t_sort:"a 
> b"}}. The call
> {{curl 
> 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort=5=on=**:**=0'}}
> should return "a b" as the top facet term with count 2, but returns
> {{ {}}
> {{ "responseHeader":{}}
> {{ "zkConnected":true,}}
> {{ "status":0,}}
> {{ "QTime":13,}}
> {{ "params":{}}
> {{ "facet.limit":"5",}}
> {{ "q":":",}}
> {{ "facet.field":"facet_t_sort",}}
> {{ "rows":"0",}}
> {{ "facet":"on"} },}}
> {{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
> {{ },}}
> {{ "facet_counts":{}}
> {{ "facet_queries":{},}}
> {{ "facet_fields":{}}
> {{ "facet_t_sort":[}}
> {{ "a b",36,}}
> {{ "a b 0",1,}}
> {{ "a b 1",1,}}
> {{ "a b 10",1,}}
> {{ "a b 11",1]},}}
> {{ "facet_ranges":{},}}
> {{ "facet_intervals":{},}}
> {{ "facet_heatmaps":{} } } }}
> The problem is the second phase of simple faceting, where the fine-counting 
> happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. 
> It wins the popularity contest as there are 2 "a b"-terms and only 1 of all 
> the other terms. The 1 or 2 shards that did not deliver "a b" in the first 
> phase are then queried for the count for "a b", which happens in the form of 
> a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
> chain and thus matches _all_ the documents in that shard (approximately 
> 102/3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13056) SortableTextField is trappy for faceting

2019-03-01 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13056:
--
Attachment: SOLR-13056.patch

> SortableTextField is trappy for faceting
> 
>
> Key: SOLR-13056
> URL: https://issues.apache.org/jira/browse/SOLR-13056
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 7.6
>Reporter: Toke Eskildsen
>Priority: Major
> Attachments: SOLR-13056.patch
>
>
> Using {{SortableTextField}} for distributed faceting can lead to wrong 
> results. This can be demonstrated by installing the cloud-version of the 
> {{gettingstarted}} sample with
> {{./solr -e cloud}}
> using defaults all the way, except for {{shards}} which should be {{3}}. 
> After that a corpus can be indexed with
> {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
> "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
> '\{"id":"duplicate_1","facet_t_sort":"a 
> b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
> 'Content-Type: application/json' 
> 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
> This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
> X"}} where X is the document number + 2 documents with {{facet_t_sort:"a 
> b"}}. The call
> {{curl 
> 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort=5=on=**:**=0'}}
> should return "a b" as the top facet term with count 2, but returns
> {{ {}}
> {{ "responseHeader":{}}
> {{ "zkConnected":true,}}
> {{ "status":0,}}
> {{ "QTime":13,}}
> {{ "params":{}}
> {{ "facet.limit":"5",}}
> {{ "q":":",}}
> {{ "facet.field":"facet_t_sort",}}
> {{ "rows":"0",}}
> {{ "facet":"on"} },}}
> {{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
> {{ },}}
> {{ "facet_counts":{}}
> {{ "facet_queries":{},}}
> {{ "facet_fields":{}}
> {{ "facet_t_sort":[}}
> {{ "a b",36,}}
> {{ "a b 0",1,}}
> {{ "a b 1",1,}}
> {{ "a b 10",1,}}
> {{ "a b 11",1]},}}
> {{ "facet_ranges":{},}}
> {{ "facet_intervals":{},}}
> {{ "facet_heatmaps":{} } } }}
> The problem is the second phase of simple faceting, where the fine-counting 
> happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. 
> It wins the popularity contest as there are 2 "a b"-terms and only 1 of all 
> the other terms. The 1 or 2 shards that did not deliver "a b" in the first 
> phase are then queried for the count for "a b", which happens in the form of 
> a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
> chain and thus matches _all_ the documents in that shard (approximately 
> 102/3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13056) SortableTextField is trappy for faceting

2019-03-01 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782077#comment-16782077
 ] 

Michael Gibney commented on SOLR-13056:
---

The attached patch builds {{TermQuery}} directly (as [~hossman] suggests). In 
{{SimpleFacets.numDocs()}} (called by {{SimpleFacets.getListedTermCounts()}}), 
we expect the {{term}} param to not require any analysis, so we should be able 
to construct the {{TermQuery}} directly, instead of calling 
{{ft.getFieldQuery()}}. This fixes things for {{SortableTextField}}, but should 
be ok for other FieldTypes as well. Do we need to be concerned about the "{{if 
(field.hasDocValues() && !field.indexed())}}" condition in 
{{FieldType.getFieldQuery()}}, or would that not apply in the context of 
{{getListedTermCounts}}?

> SortableTextField is trappy for faceting
> 
>
> Key: SOLR-13056
> URL: https://issues.apache.org/jira/browse/SOLR-13056
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 7.6
>Reporter: Toke Eskildsen
>Priority: Major
> Attachments: SOLR-13056.patch
>
>
> Using {{SortableTextField}} for distributed faceting can lead to wrong 
> results. This can be demonstrated by installing the cloud-version of the 
> {{gettingstarted}} sample with
> {{./solr -e cloud}}
> using defaults all the way, except for {{shards}} which should be {{3}}. 
> After that a corpus can be indexed with
> {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
> "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
> '\{"id":"duplicate_1","facet_t_sort":"a 
> b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
> 'Content-Type: application/json' 
> 'http://localhost:8983/solr/gettingstarted/update?commit=true'}}
> This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
> X"}} where X is the document number + 2 documents with {{facet_t_sort:"a 
> b"}}. The call
> {{curl 
> 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort=5=on=**:**=0'}}
> should return "a b" as the top facet term with count 2, but returns
> {{ {}}
> {{ "responseHeader":{}}
> {{ "zkConnected":true,}}
> {{ "status":0,}}
> {{ "QTime":13,}}
> {{ "params":{}}
> {{ "facet.limit":"5",}}
> {{ "q":":",}}
> {{ "facet.field":"facet_t_sort",}}
> {{ "rows":"0",}}
> {{ "facet":"on"} },}}
> {{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
> {{ },}}
> {{ "facet_counts":{}}
> {{ "facet_queries":{},}}
> {{ "facet_fields":{}}
> {{ "facet_t_sort":[}}
> {{ "a b",36,}}
> {{ "a b 0",1,}}
> {{ "a b 1",1,}}
> {{ "a b 10",1,}}
> {{ "a b 11",1]},}}
> {{ "facet_ranges":{},}}
> {{ "facet_intervals":{},}}
> {{ "facet_heatmaps":{} } } }}
> The problem is the second phase of simple faceting, where the fine-counting 
> happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. 
> It wins the popularity contest as there are 2 "a b"-terms and only 1 of all 
> the other terms. The 1 or 2 shards that did not deliver "a b" in the first 
> phase are then queried for the count for "a b", which happens in the form of 
> a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
> chain and thus matches _all_ the documents in that shard (approximately 
> 102/3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8695) Word delimiter graph or span queries bug

2019-02-18 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771144#comment-16771144
 ] 

Michael Gibney commented on LUCENE-8695:


I'll second [~khitrin]; if you're interested, I have pushed a branch that 
attempts to address this issue (linked to from 
[LUCENE-7398|https://issues.apache.org/jira/browse/LUCENE-7398#comment-16630529])
 ... feedback/testing welcome!

Regarding storing positionLength in the index -- would there be any interest in 
revisiting this possibility 
([LUCENE-4312|https://issues.apache.org/jira/browse/LUCENE-4312])? The 
branch/patch referenced above currently records positionLength in Payloads.

> Word delimiter graph or span queries bug
> 
>
> Key: LUCENE-8695
> URL: https://issues.apache.org/jira/browse/LUCENE-8695
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.7
>Reporter: Pawel Rog
>Priority: Major
>
> I have a simple query phrase query and a token stream which uses word 
> delimiter graph which fails to match. I tried different configurations of 
> word delimiter graph but could find a good solution for this. I don't 
> actually know if the problem is on word delimiter side or maybe on span 
> queries side.
> Query which is generated:
> {code:java}
>  spanNear([field:added, spanOr([field:foobarbaz, spanNear([field:foo, 
> field:bar, field:baz], 0, true)]), field:entry], 0, true)
> {code}
>  
> Code of test where I isolated the problem is attached below:
> {code:java}
> public class TestPhrase extends LuceneTestCase {
>   private static IndexSearcher searcher;
>   private static IndexReader reader;
>   private Query query;
>   private static Directory directory;
>   private static Analyzer searchAnalyzer = new Analyzer() {
> @Override
> public TokenStreamComponents createComponents(String fieldName) {
>   Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, 
> false);
>   TokenFilter filter1 = new WordDelimiterGraphFilter(tokenizer, 
> WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
>   WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
>   WordDelimiterGraphFilter.CATENATE_WORDS |
>   WordDelimiterGraphFilter.CATENATE_NUMBERS |
>   WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE,
>   CharArraySet.EMPTY_SET);
>   TokenFilter filter2 = new LowerCaseFilter(filter1);
>   return new TokenStreamComponents(tokenizer, filter2);
> }
>   };
>   private static Analyzer indexAnalyzer = new Analyzer() {
> @Override
> public TokenStreamComponents createComponents(String fieldName) {
>   Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, 
> false);
>   TokenFilter filter1 = new WordDelimiterGraphFilter(tokenizer, 
> WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
>   WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
>   WordDelimiterGraphFilter.GENERATE_NUMBER_PARTS |
>   WordDelimiterGraphFilter.CATENATE_WORDS |
>   WordDelimiterGraphFilter.CATENATE_NUMBERS |
>   WordDelimiterGraphFilter.PRESERVE_ORIGINAL |
>   WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE,
>   CharArraySet.EMPTY_SET);
>   TokenFilter filter2 = new LowerCaseFilter(filter1);
>   return new TokenStreamComponents(tokenizer, filter2);
> }
> @Override
> public int getPositionIncrementGap(String fieldName) {
>   return 100;
> }
>   };
>   @BeforeClass
>   public static void beforeClass() throws Exception {
> directory = newDirectory();
> RandomIndexWriter writer = new RandomIndexWriter(random(), directory, 
> indexAnalyzer);
> Document doc = new Document();
> doc.add(newTextField("field", "Added FooBarBaz entry", Field.Store.YES));
> writer.addDocument(doc);
> reader = writer.getReader();
> writer.close();
> searcher = new IndexSearcher(reader);
>   }
>   @Override
>   public void setUp() throws Exception {
> super.setUp();
>   }
>   @AfterClass
>   public static void afterClass() throws Exception {
> searcher = null;
> reader.close();
> reader = null;
> directory.close();
> directory = null;
>   }
>   public void testSearch() throws Exception {
> QueryParser parser = new QueryParser("field", searchAnalyzer);
> query = parser.parse("\"Added FooBarBaz entry\"");
> System.out.println(query);
> ScoreDoc[] hits = searcher.search(query, 1000).scoreDocs;
> assertEquals(1, hits.length);
>   }
> }
> {code}
>  
>  
> NOTE: I tested it on Lucene 7.1.0, 7.4.0 and 7.7.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-02-15 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13257:
--
Priority: Minor  (was: Major)

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: AffinityShardHandlerFactory.java
>
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-02-15 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769617#comment-16769617
 ] 

Michael Gibney edited comment on SOLR-13257 at 2/15/19 7:25 PM:


The attached standalone java class provides users the option to configure 
replica routing affinity by configuring their {{requestHandler}} with:

{code:xml}

  routingHash
  q

{code}

The {{routingParam}} allows the user to specify a parameter that will be used 
to select a replica by {{routingValIntHash % replicationFactor}}. In the 
absence of a {{routingParam}} parameter in the request, the request parameter 
configured for {{hashParam}} is used to determine replica routing. 

The special value {{DISABLE_DETERMINISTIC_ROUTING}} may be passed as the value 
of the configured {{routingParam}} query parameter to explicitly fallback to 
default (random) replica routing.


was (Author: mgibney):
The attached standalone java class provides users the option to configure 
replica routing affinity by configuring their {{requestHandler}} with:

{code:xml}

  routingHash
  q

{code}

The {{routingParam}} allows the user to specify a parameter that will be used 
to select a replica by {{routingValIntHash % replicationFactor}}. In the 
absence of a {{routingParam}} parameter in the request, the request parameter 
configured for {{hashParam}} is used to determine replica routing. 

The special value `DISABLE_DETERMINISTIC_ROUTING` may be passed as the value of 
the configured {{routingParam}} query parameter to explicitly fallback to 
default (random) replica routing.

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: AffinityShardHandlerFactory.java
>
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13257) enable replica routing affinity for better cache usage

2019-02-15 Thread Michael Gibney (JIRA)
Michael Gibney created SOLR-13257:
-

 Summary: enable replica routing affinity for better cache usage
 Key: SOLR-13257
 URL: https://issues.apache.org/jira/browse/SOLR-13257
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Affects Versions: 7.4, master (9.0)
Reporter: Michael Gibney


For each shard in a distributed request, Solr currently routes each request 
randomly via 
[ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
 to a particular replica. In setups with replication factor >1, this normally 
results in a situation where subsequent requests (which one would hope/expect 
to leverage cached results from previous related requests) end up getting 
routed to a replica that hasn't seen any related requests.

The problem can be replicated by issuing a relatively expensive query (maybe 
containing common terms?). The first request initializes the 
{{queryResultCache}} on the consulted replicas. If replication factor >1 and 
there are a sufficient number of shards, subsequent requests will likely be 
routed to at least one replica that _hasn't_ seen the query before. The 
replicas with uninitialized caches become a bottleneck, and from the client's 
perspective, many subsequent requests appear not to benefit from caching at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13257) Enable replica routing affinity for better cache usage

2019-02-15 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13257:
--
Summary: Enable replica routing affinity for better cache usage  (was: 
enable replica routing affinity for better cache usage)

> Enable replica routing affinity for better cache usage
> --
>
> Key: SOLR-13257
> URL: https://issues.apache.org/jira/browse/SOLR-13257
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
>
> For each shard in a distributed request, Solr currently routes each request 
> randomly via 
> [ShufflingReplicaListTransformer|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/ShufflingReplicaListTransformer.java]
>  to a particular replica. In setups with replication factor >1, this normally 
> results in a situation where subsequent requests (which one would hope/expect 
> to leverage cached results from previous related requests) end up getting 
> routed to a replica that hasn't seen any related requests.
> The problem can be replicated by issuing a relatively expensive query (maybe 
> containing common terms?). The first request initializes the 
> {{queryResultCache}} on the consulted replicas. If replication factor >1 and 
> there are a sufficient number of shards, subsequent requests will likely be 
> routed to at least one replica that _hasn't_ seen the query before. The 
> replicas with uninitialized caches become a bottleneck, and from the client's 
> perspective, many subsequent requests appear not to benefit from caching at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-01-31 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757467#comment-16757467
 ] 

Michael Gibney commented on SOLR-12743:
---

Ah, ok; so I guess looking for "overlapping onDeckSearcher" in logs is not 
productive.

[~markus17], thanks for the extra information! A few more questions/thoughts:
 # Does a thread dump provide any useful information? e.g., if an autowarm (or 
other) thread is blocked somewhere?
 # When the problem manifests, is the service running under load heavy enough 
that inserts/cleanup _could_ potentially monopolize a lock?
 # What are your {{autoCommit}} (and {{autoSoftCommit}}, {{commitWithin}}, 
etc.) settings? Are you also running manual commits?
 # Looking only at the code in {{SolrCore}}, it looks like the only way to get 
"PERFORMANCE WARNING: Overlapping onDeckSearchers" errors in your log is to 
have {{maxWarmingSearchers}} set to > 1. You could try setting this to "2" ... 
it's unlikely to hurt (in fact, unlikely to make a difference, per [~dsmiley]) 
– but there's a remote chance it could provide useful feedback.
 # I see you earlier noted that it's normal that two {{SolrIndexSearcher}}s 
should coexist immediately after a commit; so just to clarify, when you say it 
"immediately" leaks a {{SolrIndexSearcher}} instance, you mean it's hanging 
around longer than it should ...

> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest instances:
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 70.087.272 
> (1,35%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 65.678.264 
> (1,27%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 63.050.600 
> (1,22%) bytes. 
> Problem Suspect 2
> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.373.110.208 (26,52%) bytes. 
> {noformat}
> More details in the email threads.
> [1] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201804.mbox/%3Czarafa.5ae201c6.2f85.218a781d795b07b1%40mail1.ams.nl.openindex.io%3E]
>  [2] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201806.mbox/%3Czarafa.5b351537.7b8c.647ddc93059f68eb%40mail1.ams.nl.openindex.io%3E]
>  [3] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3c7b5e78c6-8cf6-42ee-8d28-872230ded...@gmail.com%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-01-31 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757467#comment-16757467
 ] 

Michael Gibney edited comment on SOLR-12743 at 1/31/19 4:39 PM:


Ah, ok; so I guess looking for "overlapping onDeckSearcher" in logs is not 
productive.

[~markus17], thanks for the extra information! A few more questions/thoughts:
 # Does a thread dump provide any useful information? e.g., if an autowarm (or 
other) thread is blocked somewhere?
 # When the problem manifests, is the service running under load heavy enough 
that inserts/cleanup _could_ potentially monopolize a lock?
 # What are your {{autoCommit}} (and {{autoSoftCommit}}, {{commitWithin}}, 
etc.) settings? Are you also running manual commits?
 # Looking only at the code in {{SolrCore}}, it looks like the only way to get 
"PERFORMANCE WARNING: Overlapping onDeckSearchers" errors in your log is to 
have {{maxWarmingSearchers}} set to > 1. You could try setting this to "2" ... 
it's unlikely to hurt (in fact, unlikely to make a difference, per [~dsmiley]) 
– but there's a remote chance it could provide useful feedback.
 # I see you earlier noted that it's normal that two {{SolrIndexSearchers}} 
should coexist immediately after a commit; so just to clarify, when you say it 
"immediately" leaks a {{SolrIndexSearcher}} instance, you mean it's hanging 
around longer than it should ...


was (Author: mgibney):
Ah, ok; so I guess looking for "overlapping onDeckSearcher" in logs is not 
productive.

[~markus17], thanks for the extra information! A few more questions/thoughts:
 # Does a thread dump provide any useful information? e.g., if an autowarm (or 
other) thread is blocked somewhere?
 # When the problem manifests, is the service running under load heavy enough 
that inserts/cleanup _could_ potentially monopolize a lock?
 # What are your {{autoCommit}} (and {{autoSoftCommit}}, {{commitWithin}}, 
etc.) settings? Are you also running manual commits?
 # Looking only at the code in {{SolrCore}}, it looks like the only way to get 
"PERFORMANCE WARNING: Overlapping onDeckSearchers" errors in your log is to 
have {{maxWarmingSearchers}} set to > 1. You could try setting this to "2" ... 
it's unlikely to hurt (in fact, unlikely to make a difference, per [~dsmiley]) 
– but there's a remote chance it could provide useful feedback.
 # I see you earlier noted that it's normal that two {{SolrIndexSearcher}}s 
should coexist immediately after a commit; so just to clarify, when you say it 
"immediately" leaks a {{SolrIndexSearcher}} instance, you mean it's hanging 
around longer than it should ...

> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest 

[jira] [Commented] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-01-30 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756284#comment-16756284
 ] 

Michael Gibney commented on SOLR-12743:
---

The patch just attached is a shot in the dark (I can't directly reproduce this 
problem). But I think it's probably a good patch either way, because:

I _was_ able to induce some weird behavior, artificially simulating a 
high-turnover cache environment (lots of inserts) and simultaneously trying to 
execute the {{get[Oldest|Latest]AccessedItems()}} method on 
{{ConcurrentLRUCache}}. This would be akin to what happens to an old cache that 
remains under heavy load (lots of inserts) while a new cache/searcher is being 
warmed (queries for autowarm are retrieved via the 
{{get[Oldest|Latest]AccessedItems()}} methods. 

The heart of the issue I observed is that {{get[Oldest|Latest]AccessedItems()}} 
use {{ReentrantLock.lock()}}, but {{markAndSweep()}} (for cleaning overflow 
entries) uses {{ReentrantLock.tryLock()}}. The latter is evidently much
faster, and by design does not respect the {{fairness=true}} setting on 
{{markAndSweepLock}}. So I was able to create a situation where, with heavy 
enough turnover, {{markAndSweep()}} was called regularly enough that it 
monopolized the lock, starving {{get[Oldest|Latest]AccessedItems()}}.

FWIW, I noticed that the official solr docker image moved from using openjdk 8 
to openjdk 11 in the version interval that seems to have triggered this issue.

I realize that this might fall short as an explanation for this issue, because 
the line of reasoning I'm following here would suggest that autowarming should 
block (not complete), which should \(?\) trigger "Overlapping onDeckSearcher" 
warnings. Also, it seems unlikely (though certainly not impossible) to 
consistently sustain a level of load sufficient to permanently monopolize the 
lock. 

Re: autowarming ... earlier comments are ambiguous wrt autowarm counts. _If_ 
the underlying issue is lock contention, then the _exact_ autowarm count should 
not matter, but I would expect that _disabling_ autowarm (setting to 0) would 
in fact be an effective workaround.

> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest instances:
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 70.087.272 
> (1,35%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 65.678.264 
> (1,27%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 63.050.600 
> (1,22%) bytes. 
> Problem Suspect 2
> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.373.110.208 

[jira] [Updated] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-01-30 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-12743:
--
Attachment: SOLR-12743.patch

> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest instances:
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 70.087.272 
> (1,35%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 65.678.264 
> (1,27%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 63.050.600 
> (1,22%) bytes. 
> Problem Suspect 2
> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.373.110.208 (26,52%) bytes. 
> {noformat}
> More details in the email threads.
> [1] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201804.mbox/%3Czarafa.5ae201c6.2f85.218a781d795b07b1%40mail1.ams.nl.openindex.io%3E]
>  [2] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201806.mbox/%3Czarafa.5b351537.7b8c.647ddc93059f68eb%40mail1.ams.nl.openindex.io%3E]
>  [3] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3c7b5e78c6-8cf6-42ee-8d28-872230ded...@gmail.com%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-01-28 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754164#comment-16754164
 ] 

Michael Gibney edited comment on SOLR-13132 at 1/29/19 3:03 AM:


I've refined the earlier patch (implementing parallel facet count collection 
for sort-by-relatedness). For consideration, the [^SOLR-13132-with-cache.patch] 
also implements a per-segment (and top-level) cache of facet counts (and inline 
"missing" bucket collection, fwiw).

As described [in a more discursive blog 
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet 
cache is something that's been in the back of my mind for a while, but would 
have a particular impact on sort-by-relatedness with parallel facet count 
collection, so I modified an initial implementation from simple facets 
{{DocValuesFacets}} to make it compatible with JSON facets as well.

For my use case this yields anywhere from 5x-450x latency reduction for high- 
and even modestly-high-cardinality domain queries with sort-by-relatedness. 
Facet cache alone yields ~10x latency reduction for simple sort-by-count facets 
over common/cached high-cardinality domains (e.g., {{*:*}}). More detail (rough 
benchmarks, etc.) can be found in the blog post linked above.

To enable facet cache, in {{solrconfig.xml}}:
{code:xml}

{code}
(I realize the "facet cache" should probably be a separate issue, but given its 
particular relevance as a complement to this issue, I opted to include it in 
this patch. I hope that's ok ...)


was (Author: mgibney):
I've refined the earlier patch (implementing parallel facet count collection 
for sort-by-relatedness). For consideration, the [new 
patch|^SOLR-13132-with-cache.patch] also implements a per-segment (and 
top-level) cache of facet counts (and inline "missing" bucket collection, fwiw).

As described [in a more discursive blog 
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet 
cache is something that's been in the back of my mind for a while, but would 
have a particular impact on sort-by-relatedness with parallel facet count 
collection, so I modified an initial implementation from simple facets 
{{DocValuesFacets}} to make it compatible with JSON facets as well.

FYI, for my (real-world) test use case this yields anywhere from 5x-450x 
latency reduction for high- and even modestly-high-cardinality domain queries 
with sort-by-relatedness. Facet cache alone yields ~10x latency reduction for 
simple sort-by-count facets over common/cached high-cardinality domains (e.g., 
{{*:*}}). More detail (rough benchmarks, etc.) can be found in the blog post 
linked above.

To enable facet cache, in {{solrconfig.xml}}:
{code:xml}

{code}
(I realize the "facet cache" should probably be a separate issue, but given its 
particular relevance as a complement to this issue, I opted to include it in 
this patch. I hope that's ok ...)

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132-with-cache.patch, SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set 

[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-01-10 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13132:
--
Attachment: (was: SOLR-13132.patch)

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-01-10 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13132:
--
Attachment: SOLR-13132.patch

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-01-10 Thread Michael Gibney (JIRA)
Michael Gibney created SOLR-13132:
-

 Summary: Improve JSON "terms" facet performance when sorted by 
relatedness 
 Key: SOLR-13132
 URL: https://issues.apache.org/jira/browse/SOLR-13132
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Facet Module
Affects Versions: 7.4, master (9.0)
Reporter: Michael Gibney


When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
{{relatedness}} for every term. 

The current implementation uses a standard uninverted approach (either 
{{docValues}} or {{UnInvertedField}}) to get facet counts over the domain base 
docSet, and then uses that initial pass as a pre-filter for a second-pass, 
inverted approach of fetching docSets for each relevant term (i.e., {{count > 
minCount}}?) and calculating intersection size of those sets with the domain 
base docSet.

Over high-cardinality fields, the overhead of per-term docSet creation and set 
intersection operations increases request latency to the point where 
relatedness sort may not be usable in practice (for my use case, even after 
applying the patch for SOLR-13108, for a field with ~220k unique terms per 
core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
1816684=9000ms, cardinality 5032902=18000ms).

The attached patch brings the above example QTimes down to a manageable ~300ms 
and ~250ms respectively. The approach calculates uninverted facet counts over 
domain base, foreground, and background docSets in parallel in a single pass. 
This allows us to take advantage of the efficiencies built into the standard 
uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids the per-term 
docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

2019-01-10 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13132:
--
Attachment: SOLR-13132.patch

> Improve JSON "terms" facet performance when sorted by relatedness 
> --
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.4, master (9.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13108) RelatednessAgg ignores cacheDf, consults filterCache for every bucket/term

2019-01-03 Thread Michael Gibney (JIRA)
Michael Gibney created SOLR-13108:
-

 Summary: RelatednessAgg ignores cacheDf, consults filterCache for 
every bucket/term
 Key: SOLR-13108
 URL: https://issues.apache.org/jira/browse/SOLR-13108
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Facet Module
Affects Versions: 7.4, master (8.0)
Reporter: Michael Gibney


The {{relatedness}} aggregation function in JSON facet API ignores {{cacheDf}} 
setting and consults the filterCache for every bucket. This is ok e.g. for 
"Query" facet type, where buckets are explicitly enumerated (and thus probably 
relatively low cardinality). But for "Terms" facet type, where bucket count is 
determined by the corpus, this can be a problem. When used over even modestly 
high-cardinality fields, this is very likely to blow out the filterCache.

See also issue with similar consequences: SOLR-9350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-17 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney resolved LUCENE-8610.

   Resolution: Not A Bug
Lucene Fields:   (was: New,Patch Available)

Will look at addressing this in {{solr.PreAnalyzedAnalyzer}}. There may be 
other {{TokenStreams}} that don't strictly follow the requirement to initialize 
all token Attributes on instantiation, but as of now nothing relating to this 
issue is known to be broken.

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-17 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723263#comment-16723263
 ] 

Michael Gibney commented on LUCENE-8610:


"'Must' implies a requirement" – yes, that's definitely how I initially read 
it. I guess any potential confusion would depend on how one reads the preceding 
"To make sure that filters and consumers know which attributes are available":
 1. if read as an _explanation_, then "must" implies a requirement
 2. if read as a _condition_, then "must" is a qualified requirement (or 
suggestion/warning)

Thanks for the clarification, and I'll see if I can take a stab at bringing 
{{PreAnalyzedTokenizer}} into line with the requirement.

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13077) PreAnalyzedField TokenStreamComponents should be reusable

2018-12-17 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-13077:
--
Description: 
{{TokenStreamComponents}} for {{PreAnalyzedField}} is currently recreated from 
scratch for every field value.

This is necessary at the moment because the current implementation has no a 
priori knowledge about the schema/TokenStream that it's deserializing – 
Attributes are implicit in the serialized token stream, and token Attributes 
are lazily instantiated in {{incrementToken()}}.

Reuse of {{TokenStreamComponents}} with the current implementation would at a 
minimum cause problems at index time, when Attributes are cached in indexing 
components (e.g., {{FieldInvertState}}), keyed per {{AttributeSource}}. For 
instance, if the first field encountered has no value specified for 
{{PayloadAttribute}}, a {{null}} value would be cached for that 
{{PayloadAttribute}} for the corresponding {{AttributeSource}}. If that 
{{AttributeSource}} were to be reused for a field that _does_ specify a 
{{PayloadAttribute}}, indexing components would "consult" the cached {{null}} 
value, and the payload (and all subsequent payloads) would be silently ignored 
(not indexed).

This is not exactly _broken_ currently, but I gather it's an unorthodox 
implementation of {{TokenStream}}, and the current workaround of disabling 
{{TokenStreamComponents}} reuse necessarily adds to object creation and GC 
overhead.

For reference (and see LUCENE-8610), the [TokenStream 
API|https://lucene.apache.org/core/7_5_0/core/org/apache/lucene/analysis/TokenStream.html]
 says:
{quote}To make sure that filters and consumers know which attributes are 
available, the attributes must be added during instantiation.
{quote}

  was:
{{TokenStreamComponents}} for {{PreAnalyzedField}} is currently recreated from 
scratch for every field value.

This is necessary at the moment because the current implementation has no a 
priori knowledge about the schema/TokenStream that it's deserializing -- 
Attributes are implicit in the serialized token stream, and token Attributes 
are lazily instantiated in {{incrementToken()}}.

Reuse of {{TokenStreamComponents}} with the current implementation would at a 
minimum cause problems at index time, when Attributes are cached in indexing 
components (e.g., {{FieldInvertState}}), keyed per {{AttributeSource}}. For 
instance, if the first field encountered has no value specified for 
{{PayloadAttribute}}, a {{null}} value would be cached for that 
{{PayloadAttribute}} for the corresponding {{AttributeSource}}. If that 
{{AttributeSource}} were to be reused for a field that _does_ specify a 
{{PayloadAttribute}}, indexing components would "consult" the cached {{null}} 
value, and the payload (and all subsequent payloads) would be silently ignored 
(not indexed).

This is not exactly _broken_ currently, but I gather it's an unorthodox 
implementation of {{TokenStream}}, and the current workaround of disabling 
{{TokenStreamComponents}} reuse necessarily adds to object creation and GC 
overhead.

For reference, the [TokenStream 
API|https://lucene.apache.org/core/7_5_0/core/org/apache/lucene/analysis/TokenStream.html]
 says:
bq.To make sure that filters and consumers know which attributes are available, 
the attributes must be added during instantiation.


> PreAnalyzedField TokenStreamComponents should be reusable
> -
>
> Key: SOLR-13077
> URL: https://issues.apache.org/jira/browse/SOLR-13077
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Michael Gibney
>Priority: Minor
>
> {{TokenStreamComponents}} for {{PreAnalyzedField}} is currently recreated 
> from scratch for every field value.
> This is necessary at the moment because the current implementation has no a 
> priori knowledge about the schema/TokenStream that it's deserializing – 
> Attributes are implicit in the serialized token stream, and token Attributes 
> are lazily instantiated in {{incrementToken()}}.
> Reuse of {{TokenStreamComponents}} with the current implementation would at a 
> minimum cause problems at index time, when Attributes are cached in indexing 
> components (e.g., {{FieldInvertState}}), keyed per {{AttributeSource}}. For 
> instance, if the first field encountered has no value specified for 
> {{PayloadAttribute}}, a {{null}} value would be cached for that 
> {{PayloadAttribute}} for the corresponding {{AttributeSource}}. If that 
> {{AttributeSource}} were to be reused for a field that _does_ specify a 
> {{PayloadAttribute}}, indexing components would "consult" the cached {{null}} 
> value, and the payload (and all subsequent payloads) would be silently 
> ignored (not 

[jira] [Commented] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-17 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723164#comment-16723164
 ] 

Michael Gibney commented on LUCENE-8610:


Thanks, [~romseygeek]! Created SOLR-13077.

I do see that the [TokenStream 
API|https://lucene.apache.org/core/7_6_0/core/org/apache/lucene/analysis/TokenStream.html]
 says:
{quote}To make sure that filters and consumers know which attributes are 
available, the attributes must be added during instantiation.
{quote}

Is this a requirement, or a recommendation? It reads a bit like a requirement; 
but you could also read it as being a recommendation and a warning ("if you 
*do* add attributes after instantiation, filters and consumers might not know 
which attributes are available, so proceed at your own risk")

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13077) PreAnalyzedField TokenStreamComponents should be reusable

2018-12-17 Thread Michael Gibney (JIRA)
Michael Gibney created SOLR-13077:
-

 Summary: PreAnalyzedField TokenStreamComponents should be reusable
 Key: SOLR-13077
 URL: https://issues.apache.org/jira/browse/SOLR-13077
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Schema and Analysis
Reporter: Michael Gibney


{{TokenStreamComponents}} for {{PreAnalyzedField}} is currently recreated from 
scratch for every field value.

This is necessary at the moment because the current implementation has no a 
priori knowledge about the schema/TokenStream that it's deserializing -- 
Attributes are implicit in the serialized token stream, and token Attributes 
are lazily instantiated in {{incrementToken()}}.

Reuse of {{TokenStreamComponents}} with the current implementation would at a 
minimum cause problems at index time, when Attributes are cached in indexing 
components (e.g., {{FieldInvertState}}), keyed per {{AttributeSource}}. For 
instance, if the first field encountered has no value specified for 
{{PayloadAttribute}}, a {{null}} value would be cached for that 
{{PayloadAttribute}} for the corresponding {{AttributeSource}}. If that 
{{AttributeSource}} were to be reused for a field that _does_ specify a 
{{PayloadAttribute}}, indexing components would "consult" the cached {{null}} 
value, and the payload (and all subsequent payloads) would be silently ignored 
(not indexed).

This is not exactly _broken_ currently, but I gather it's an unorthodox 
implementation of {{TokenStream}}, and the current workaround of disabling 
{{TokenStreamComponents}} reuse necessarily adds to object creation and GC 
overhead.

For reference, the [TokenStream 
API|https://lucene.apache.org/core/7_5_0/core/org/apache/lucene/analysis/TokenStream.html]
 says:
bq.To make sure that filters and consumers know which attributes are available, 
the attributes must be added during instantiation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8610:
---
Description: 
{{DefaultIndexingChain.invert(...)}} calls 
{{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
is called.

For instances of {{stream}} that lazily instantiate token attributes (e.g., as 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
{{incrementToken()}} that returns {{true}}), this can result in caching a 
{{null}} value in {{invertState.termAttribute}} for a given {{stream}} instance.

Subsequent calls that reuse the same {{stream}} instance (reusing 
{{TokenStreamComponents}}) for field values with at least 1 token will call 
{{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
assumes a non-null value for {{termAtt}}.

  was:
{{DefaultIndexingChain.invert(...)}} calls 
{{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
is called.

For instances of {{stream}} that lazily instantiate token attributes (e.g., as 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
{{incrementToken()}} that returns {{true}}), this can result in caching a 
{{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
instance. 

Subsequent calls that reuse the same {{stream}} instance (reusing 
{{TokenStreamComponents}}) for field values with at least 1 token will call 
{{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE is 
thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes a 
non-null value for {{termAtt}}.


> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721664#comment-16721664
 ] 

Michael Gibney edited comment on LUCENE-8610 at 12/15/18 6:57 AM:
--

[^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to 
delay call to {{invertState.setAttributeSource(...)}} until after 
{{stream.incrementToken()}} first returns {{true}}.

It occurs to me that this might raise other issues about the way stream 
Attributes are cached in {{invertState}}, but in any case this patch fixes the 
only real-world use case I'm currently aware of (manifesting for 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields 
(i.e., no tokens) [*EDIT: PreAnalyzedTokenizer instances are _not_ reused at 
index time, so this is a bad example*]), and I it doesn't seem likely to be 
otherwise problematic.

Briefly, the other issues around caching that this raises:
 1. Is there official guidance from the TokenStream API regarding whether it's 
acceptable to lazily instantiate token Attributes?
 2. If ok to lazily instantiate, how about lazily instantiating _after_ the 
initial call to {{incrementToken()}} returns true? (I'd guess not?)


was (Author: mgibney):
[^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to 
delay call to {{invertState.setAttributeSource(...)}} until after 
{{stream.incrementToken()}} first returns {{true}}. 

It occurs to me that this might raise other issues about the way stream 
Attributes are cached in {{invertState}}, but in any case this patch fixes the 
only real-world use case I'm currently aware of (manifesting for 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields 
(i.e., no tokens), and I it doesn't seem likely to be otherwise problematic.

Briefly, the other issues around caching that this raises:
1. Is there official guidance from the TokenStream API regarding whether it's 
acceptable to lazily instantiate token Attributes?
2. If ok to lazily instantiate, how about lazily instantiating _after_ the 
initial call to {{incrementToken()}} returns true? (I'd guess not?)
3. {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} uses 
{{removeAllAttributes()}}, so caching is going to be a problem there under any 
circumstances, right? I'm going to dig into this and possibly submit a separate 
Solr issue for that one ...

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722043#comment-16722043
 ] 

Michael Gibney commented on LUCENE-8610:


Changed to "minor wish"; this patch still might be a good idea, but I 
encountered it in practice because I was using 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} incorrectly. 
{{PreAnalyzedTokenizer}} (and its token Attributes) is not designed to be 
reused _at all_ at index time. Caching concerns only apply to TokenStream 
reuse, so now that I've corrected my use of {{PreAnalyzedTokenizer}}, this 
patch could be viewed as a solution in search of a problem.

If this patch still has any merit, it would be because:
 1. there might be TokenStreams that lazily instantiate token Attributes and 
_are_ reused, or
 2. this change would be a prerequisite for potentially modifying 
{{PreAnalyzedTokenizer}} to enable reuse, thus avoiding creation of a 
{{PreAnalyzedTokenizer}} (and all associated token Attributes) for every field 
value.

I'm fine with just closing this issue; but again it's a pretty minor change 
that won't hurt anything, and could in some cases make indexing more robust. Or 
at least clarify whether it's acceptable for index-time TokenStreams to lazily 
instantiate token Attributes ...

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes 
> a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8610:
---
Issue Type: Wish  (was: Bug)

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes 
> a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8610:
---
Priority: Minor  (was: Major)

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes 
> a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8610:
---
Attachment: (was: LUCENE-8610.patch)

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Major
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes 
> a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-8610:
---
Attachment: LUCENE-8610.patch

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -
>
> Key: LUCENE-8610
> URL: https://issues.apache.org/jira/browse/LUCENE-8610
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.4, master (8.0)
>Reporter: Michael Gibney
>Priority: Major
> Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance. 
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> is thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes 
> a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8610) NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated token Attributes

2018-12-14 Thread Michael Gibney (JIRA)
Michael Gibney created LUCENE-8610:
--

 Summary: NPE in TermsHashPerField.add() for TokenStreams with 
lazily instantiated token Attributes
 Key: LUCENE-8610
 URL: https://issues.apache.org/jira/browse/LUCENE-8610
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 7.4, master (8.0)
Reporter: Michael Gibney


{{DefaultIndexingChain.invert(...)}} calls 
{{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
is called.

For instances of {{stream}} that lazily instantiate token attributes (e.g., as 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
{{incrementToken()}} that returns {{true}}), this can result in caching a 
{{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
instance. 

Subsequent calls that reuse the same {{stream}} instance (reusing 
{{TokenStreamComponents}}) for field values with at least 1 token will call 
{{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE is 
thrown when {{termsHashPerField.add()}} reasonably but incorrectly assumes a 
non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-15 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688455#comment-16688455
 ] 

Michael Gibney commented on LUCENE-8563:


I see; +1 as well.

Seeing the main practical motivation for the change as being "comparable scores 
across queries", this would I think also improve (unboosted) score 
comparability (relevant for dismax) across different fields configured with 
different similarities and different k1 (TF saturation rate). So this might 
ultimately _help_ significantly in cases that paradoxically have the bumpiest 
migration path ...

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-14 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687440#comment-16687440
 ] 

Michael Gibney commented on LUCENE-8563:


[~jpountz], thanks for pointing out the work on BM25F. I'm interested to take a 
closer look at that.
"Users could multiply their per-field boosts by (k1+1)?" ... thanks, yes! That 
should work in a pinch, though I was so focused on the Similarity that I missed 
the possibility of scaling it externally in this way.

Having k1's presence in the numerator be configurable (either as an extra 
boolean parameter to the (modified) existing BM25Similarity, or something along 
the lines of what [~softwaredoug] suggests) would make sense to me, regardless 
of the benefits of the change (performance or otherwise).

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-14 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686877#comment-16686877
 ] 

Michael Gibney commented on LUCENE-8563:


"assuming a single similarity" -- is this something that we want to assume? If 
every field similarity uses the same k1 param, then sure, relative ordering 
among fields is maintained. But if we're using these scores outside of the 
context of single-similarity, and intend to preserve the ability to adjust the 
k1 param, it's worth noting that this change fundamentally alters the effect of 
the k1 param on absolute scores (and thus also on relative scores across 
similarities).

Namely, removing k1 from the numerator places a hard cap on the score, 
regardless of TF or k1 setting. The concept of saturation is preserved, but 
with no numerator k1, saturation is implemented strictly by depressing scores 
(with respect to the hard cap, by varying amounts according to TF) as k1 
increases. The model with k1 in the numerator strikes me as being more 
flexible, both depressing scores for lower TF _and increasing_ scores for 
higher TF, around an inflection point determined by length norms and the value 
of b.

I'm sure this change would be appropriate for some scenarios, but it's a 
fundamental change that could in some cases have significant downstream 
consequences, with no easy way (as far as I can tell) to maintain existing 
behavior.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12952) TFIDF scorer uses max docs instead of num docs when using Edismax

2018-11-01 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671583#comment-16671583
 ] 

Michael Gibney commented on SOLR-12952:
---

[~moshebla], other than using TLOG/PULL replica types (as [~elyograg] 
suggests), you could look into [statsCache (distributed 
IDF)|https://lucene.apache.org/solr/guide/7_5/distributed-requests.html#configuring-statscache-distributed-idf],
 which basically does some post-calculation adjustment to normalize IDF across 
different shards.

> TFIDF scorer uses max docs instead of num docs when using Edismax
> -
>
> Key: SOLR-12952
> URL: https://issues.apache.org/jira/browse/SOLR-12952
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: mosh
>Priority: Major
>
> I have recently noticed some odd behavior while using the edismax query 
> parser.
>  The scores returned by documents seem to be affected by deleted documents, 
> which have yet to be merged and completely removed from the index.
>  This causes different replicas to return different scores for the same query.
>  Is this a bug, or am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12243) Edismax missing phrase queries when phrases contain multiterm synonyms

2018-10-30 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669095#comment-16669095
 ] 

Michael Gibney commented on SOLR-12243:
---

[~thetaphi], that makes sense to me.

"The fix would be to improve SpanNear to produce the same query like 
PhraseQuery for certain slop values," see LUCENE-8544.

> Edismax missing phrase queries when phrases contain multiterm synonyms
> --
>
> Key: SOLR-12243
> URL: https://issues.apache.org/jira/browse/SOLR-12243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.1
> Environment: RHEL, MacOS X
> Do not believe this is environment-specific.
>Reporter: Elizabeth Haubert
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, 
> SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, multiword-synonyms.txt, 
> schema.xml, solrconfig.xml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> synonyms.txt:
> {code}
> allergic, hypersensitive
> aspirin, acetylsalicylic acid
> dog, canine, canis familiris, k 9
> rat, rattus
> {code}
> request handler:
> {code:xml}
> 
>  
> 
>  edismax
>   0.4
>  title^100
>  title~20^5000
>  title~11
>  title~22^1000
>  text
>  
>  3-1 6-3 930%
>  *:*
>  25
> 
> 
> {code}
> Phrase queries (pf, pf2, pf3) containing "dog" or "aspirin"  against the 
> above list will not be generated.
> "allergic reaction dog" will generate pf2: "allergic reaction", but not 
> pf:"allergic reaction dog", pf2: "reaction dog", or pf3: "allergic reaction 
> dog"
> "aspirin dose in rats" will generate pf3: "dose ? rats" but not pf2: "aspirin 
> dose" or pf3:"aspirin dose ?"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12243) Edismax missing phrase queries when phrases contain multiterm synonyms

2018-10-30 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668907#comment-16668907
 ] 

Michael Gibney commented on SOLR-12243:
---

I share [~ehaubert]'s intuition that "adding top-level clauses comes up as more 
expensive than the embedded spans", including the caveat: "but suppose would 
need to add a test to say that for sure".

I would characterize the permutation problem as substantially _different_ as a 
result of the recent Lucene fixes. The fixes (LUCENE-8531) essentially reverted 
LUCENE-7638, an optimization that explicitly set out to address the potential 
for combinatoric explosion. A {{SpanQuery}}-based implementation still has to 
deal with the possibility of combinatoric _matching_ in documents, but the 
{{SpanQuery}} implementation handles graph "branching" natively (as nested 
Spans), as opposed to running a full, separate, top-level, {{PhraseQuery}} for 
each possible combination of terms.

> Edismax missing phrase queries when phrases contain multiterm synonyms
> --
>
> Key: SOLR-12243
> URL: https://issues.apache.org/jira/browse/SOLR-12243
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 7.1
> Environment: RHEL, MacOS X
> Do not believe this is environment-specific.
>Reporter: Elizabeth Haubert
>Assignee: Uwe Schindler
>Priority: Major
> Attachments: SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, 
> SOLR-12243.patch, SOLR-12243.patch, SOLR-12243.patch, multiword-synonyms.txt, 
> schema.xml, solrconfig.xml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> synonyms.txt:
> {code}
> allergic, hypersensitive
> aspirin, acetylsalicylic acid
> dog, canine, canis familiris, k 9
> rat, rattus
> {code}
> request handler:
> {code:xml}
> 
>  
> 
>  edismax
>   0.4
>  title^100
>  title~20^5000
>  title~11
>  title~22^1000
>  text
>  
>  3-1 6-3 930%
>  *:*
>  25
> 
> 
> {code}
> Phrase queries (pf, pf2, pf3) containing "dog" or "aspirin"  against the 
> above list will not be generated.
> "allergic reaction dog" will generate pf2: "allergic reaction", but not 
> pf:"allergic reaction dog", pf2: "reaction dog", or pf3: "allergic reaction 
> dog"
> "aspirin dose in rats" will generate pf3: "dose ? rats" but not pf2: "aspirin 
> dose" or pf3:"aspirin dose ?"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-29 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667406#comment-16667406
 ] 

Michael Gibney commented on LUCENE-8509:


I'd echo [~dsmiley]'s comment over at LUCENE-8516 – "I don't see the big deal 
in a token filter doing tokenization. I see it has certain challenges but don't 
think it's fundamentally wrong".

A special case of the "not-so-crazy" idea proposed above would have WDGF remain 
a {{TokenFilter}}, but require it to be configured to take input directly from 
a {{Tokenizer}} (as opposed to more general {{TokenStream}}). I think this 
would be functionally equivalent to the change proposed at LUCENE-8516. This 
special case would obviate the need for tracking whether there exists a 1:1 
correspondence between input offsets and token text, because such 
correspondence should (?) always exist immediately after the {{Tokenizer}}. 
This approach (or the slightly more general/elaborate "not-so-crazy" approach 
described above) might also address [~rcmuir]'s observation at LUCENE-8516 that 
the {{WordDelimiterTokenizer}} could be viewed as "still a tokenfilter in 
disguise".

As a side note, the configuration referenced in the title and description of 
this issue doesn't particularly well illustrate the more general problem, 
because the problem with this configuration could be equally well addressed by 
causing {{TrimFilter}} to update offsets, or (I think with no affect on 
intended behavior) by simply reordering filters so that {{TrimFilter}} comes 
after WDGF.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12922) Facet parser plugin for json.facet aka custom facet types

2018-10-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665299#comment-16665299
 ] 

Michael Gibney edited comment on SOLR-12922 at 10/26/18 3:48 PM:
-

 "more people who can implement own facet handling rather than those who can 
build patched Solr" – indeed!

"wide variety or user specific facet handling logic" – sure, but I guess was 
just curious about more specific use cases. Not necessary, but helpful to avoid 
requiring everyone to do their own imagining of use cases.

I guess I can also weigh in on use cases here ... this issue piqued my interest 
because I was interested in the ability to dynamically vary the configuration 
of subfacets based on attributes of the parent bucket. A concrete example would 
be:
{code:java}
json.facet={
  subject_level_0: {
type: terms,
field: subject_f,
facet: {
  subject_level_1: {
type: terms,
prefix: $parent,
field: subject_f,
limit: 5
  }
}
  }
}
{code}
Note that "$parent" for the "prefix" attribute of the subfacet is _not_ 
supported syntax, but is intended to denote the value of the term in the parent 
bucket. The idea is to support hierarchical browsing over fields whose values 
are hierarchical in nature (e.g., "United States", "United States – History", 
"United States – History – 1783-1815", etc.).

So "$parent" is not supported syntax, but this plugin architecture would make 
it possible to create a subclass of {{FacetFieldParser}} whose {{parse()}} 
method would return a subclass of {{FacetField}} whose 
{{createFacetProcessor(FacetContext fcontext)}} method could parse the parent 
bucket value out of the {{FacetContext.filter}} and return a {{FacetProcessor}} 
with a contextually-determined prefix ... or something like that.

More generally though, I could imagine wanting to do other types of dynamic 
(parameterized and/or contextual) facet configuration, some to support very 
specialized use cases, that would be much more straightforwardly and 
sustainably implemented with this type of plugin architecture.


was (Author: mgibney):
 "more people who can implement own facet handling rather than those who can 
build patched Solr" – indeed!

"wide variety or user specific facet handling logic" – sure, but I guess was 
just curious about more specific use cases. Not necessary, but helpful to avoid 
requiring everyone to do their own imagining of use cases.

I guess I can also weigh in on use cases here ... this issue piqued my interest 
because I was interested in the ability to dynamically vary the configuration 
of subfacets based on attributes of the parent bucket. A concrete example would 
be:
{code:java}
json.facet={
  subject_level_0: {
type: terms,
field: subject_f,
facet: {
  subject_level_1: {
type: terms,
prefix: $parent,
field: subject_f,
limit: 5
  }
}
  }
}
{code}
Note that "$parent" for the "prefix" attribute of the subfacet is _not_ 
supported syntax, but is intended to denote the value of the term in the parent 
bucket. The idea is to support hierarchical browsing over fields whose values 
are hierarchical in nature (e.g., "United States", "United States--History", 
"United States--History--1783-1815", etc.).

So "$parent" is not supported syntax, but this plugin architecture would make 
it possible to create a subclass of {{FacetFieldParser}} whose {{parse()}} 
method would return a subclass of {{FacetField}} whose 
{{createFacetProcessor(FacetContext fcontext)}} method could parse the parent 
bucket value out of the {{FacetContext.filter}} and return a {{FacetProcessor}} 
with a contextually-determined prefix ... or something like that.

More generally though, I could imagine wanting to do other types of dynamic 
(parameterized and/or contextual) facet configuration, some to support very 
specialized use cases, that would be much more straightforwardly and 
sustainably implemented with this type of plugin architecture.

> Facet parser plugin for json.facet aka custom facet types
> -
>
> Key: SOLR-12922
> URL: https://issues.apache.org/jira/browse/SOLR-12922
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Minor
> Attachments: SOLR-12922.patch, SOLR-12922.patch
>
>
> Why don't introduce a plugin for json facet parsers? Attaching draft patch, 
> it just demonstrate the thing. Test fails, iirc. Opinions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] [Commented] (SOLR-12922) Facet parser plugin for json.facet aka custom facet types

2018-10-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665299#comment-16665299
 ] 

Michael Gibney commented on SOLR-12922:
---

 "more people who can implement own facet handling rather than those who can 
build patched Solr" – indeed!

"wide variety or user specific facet handling logic" – sure, but I guess was 
just curious about more specific use cases. Not necessary, but helpful to avoid 
requiring everyone to do their own imagining of use cases.

I guess I can also weigh in on use cases here ... this issue piqued my interest 
because I was interested in the ability to dynamically vary the configuration 
of subfacets based on attributes of the parent bucket. A concrete example would 
be:
{code:java}
json.facet={
  subject_level_0: {
type: terms,
field: subject_f,
facet: {
  subject_level_1: {
type: terms,
prefix: $parent,
field: subject_f,
limit: 5
  }
}
  }
}
{code}
Note that "$parent" for the "prefix" attribute of the subfacet is _not_ 
supported syntax, but is intended to denote the value of the term in the parent 
bucket. The idea is to support hierarchical browsing over fields whose values 
are hierarchical in nature (e.g., "United States", "United States--History", 
"United States--History--1783-1815", etc.).

So "$parent" is not supported syntax, but this plugin architecture would make 
it possible to create a subclass of {{FacetFieldParser}} whose {{parse()}} 
method would return a subclass of {{FacetField}} whose 
{{createFacetProcessor(FacetContext fcontext)}} method could parse the parent 
bucket value out of the {{FacetContext.filter}} and return a {{FacetProcessor}} 
with a contextually-determined prefix ... or something like that.

More generally though, I could imagine wanting to do other types of dynamic 
(parameterized and/or contextual) facet configuration, some to support very 
specialized use cases, that would be much more straightforwardly and 
sustainably implemented with this type of plugin architecture.

> Facet parser plugin for json.facet aka custom facet types
> -
>
> Key: SOLR-12922
> URL: https://issues.apache.org/jira/browse/SOLR-12922
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Minor
> Attachments: SOLR-12922.patch, SOLR-12922.patch
>
>
> Why don't introduce a plugin for json facet parsers? Attaching draft patch, 
> it just demonstrate the thing. Test fails, iirc. Opinions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12922) Facet parser plugin for json.facet

2018-10-25 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664234#comment-16664234
 ] 

Michael Gibney commented on SOLR-12922:
---

This is interesting. Could the behavior of the proof-of-concept 
{{FuncRangeFacetParser}} be achieved with stock range faceting? Regardless, 
it's a good proof-of-concept; but curiosity has me wondering about additional 
use cases that might help illustrate some of the potential uses/implications of 
the proposed plugin extension.

Also, I just uploaded a slightly modified patch with passing test.

> Facet parser plugin for json.facet
> --
>
> Key: SOLR-12922
> URL: https://issues.apache.org/jira/browse/SOLR-12922
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Minor
> Attachments: SOLR-12922.patch, SOLR-12922.patch
>
>
> Why don't introduce a plugin for json facet parsers? Attaching draft patch, 
> it just demonstrate the thing. Test fails, iirc. Opinions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12922) Facet parser plugin for json.facet

2018-10-25 Thread Michael Gibney (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-12922:
--
Attachment: SOLR-12922.patch

> Facet parser plugin for json.facet
> --
>
> Key: SOLR-12922
> URL: https://issues.apache.org/jira/browse/SOLR-12922
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Minor
> Attachments: SOLR-12922.patch, SOLR-12922.patch
>
>
> Why don't introduce a plugin for json facet parsers? Attaching draft patch, 
> it just demonstrate the thing. Test fails, iirc. Opinions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8544) In SpanNearQuery, add support for inOrder semantics equivalent to that of (Multi)PhraseQuery

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662390#comment-16662390
 ] 

Michael Gibney edited comment on LUCENE-8544 at 10/25/18 2:57 AM:
--

I can't immediately speak to the possibility of adding this functionality to 
the existing implementation of {{SpanNearQuery}}, but building on a pending 
patch for LUCENE-7398, I think it might actually be pretty straightforward.

The above-referenced patch makes {{NearSpansOrdered}} aware of indexed 
{{PositionLengthAttribute}}. In a positionLength-aware context, it wasn't clear 
to me how to port the {{NearSpansOrdered}} changes to {{NearSpansUnordered}}; 
there were a number of ways to interpret the task, all of which looked pretty 
complicated and/or messy and/or difficult-verging-on-impossible to implement in 
a performant way (and at a higher level, they all seemed a bit semantically 
weird).

But positionLength-aware implementation of {{(Multi)PhraseQuery}} semantics in 
the context of the above-referenced patch should be much simpler: given that 
you have a fixed clause ordering, it just requires supporting negative offsets 
in calculation of slop/edit distance.


was (Author: mgibney):
I can't immediately speak to the possibility of adding this functionality to 
the existing implementation of {{SpanNearQuery}}, but building on an 
outstanding patch for 
[LUCENE-7398|https://issues.apache.org/jira/browse/LUCENE-7398?focusedCommentId=16630529#comment-16630529],
 I think it might actually be pretty straightforward.

The above-referenced patch makes {{NearSpansOrdered}} aware of indexed 
{{PositionLengthAttribute}}. In a positionLength-aware context, it wasn't clear 
to me how to port the {{NearSpansOrdered}} changes to {{NearSpansUnordered}}; 
there were a number of ways to interpret the task, all of which looked pretty 
complicated and/or messy and/or difficult-verging-on-impossible to implement in 
a performant way (and at a higher level, they all seemed a bit semantically 
weird).

But positionLength-aware implementation of {{(Multi)PhraseQuery}} semantics in 
the context of the above-referenced patch should be much simpler: given that 
you have a fixed clause ordering, it just requires supporting negative offsets 
in calculation of slop/edit distance.

> In SpanNearQuery, add support for inOrder semantics equivalent to that of 
> (Multi)PhraseQuery
> 
>
> Key: LUCENE-8544
> URL: https://issues.apache.org/jira/browse/LUCENE-8544
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Michael Gibney
>Priority: Minor
>
> As discussed in LUCENE-8531, the semantics of phrase search differs among 
> {{(Multi)PhraseQuery}}, {{SpanNearQuery (inOrder=true)}}, and {{SpanNearQuery 
> (inOrder=false)}}:
>  * {{(Multi)PhraseQuery}}: incorporates the concept of order, and allows 
> negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=true)}}: incorporates the concept of order, and 
> does _not_ allow negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=false)}}: does not incorporate the concept of 
> order at all
> This issue concerns the possibility of adjusting {{SpanNearQuery}} to be 
> configurable to support semantics equivalent to that of 
> {{(Multi)PhraseQuery}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663115#comment-16663115
 ] 

Michael Gibney commented on LUCENE-8509:


> The trim filter removes the leading space from the second token, leaving 
> offsets unchanged, so WDGF sees "bb"[1,4]; 

If I understand correctly what [~dsmiley] is saying, then to put it another 
way: doesn't this look more like an issue with {{TrimFilter}}? If WDGF sees as 
input from {{TrimFilter}} "bb"[1,4] (instead of " bb"[1,4] or "bb"[2,4]), then 
it's handling the input correctly, but the input is wrong.

"because tokenization splits offsets and WDGF is playing the role of a 
tokenizer" -- this behavior is notably different from what 
{{SynonymGraphFilter}} does (adding externally-specified alternate 
representations of input tokens). Offsets are really only meaningful with 
respect to input, and new tokens introduced by WDGF are directly derived from 
input, while new tokens introduced by {{SynonymGraphFilter}} are not and thus 
can _only_ inherit offsets of the input token.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662397#comment-16662397
 ] 

Michael Gibney edited comment on LUCENE-8531 at 10/24/18 3:06 PM:
--

Thanks [~thetaphi], and I agree that it would be a separate issue (^"Would it 
be worth opening a new issue to consider introducing the ability to 
specifically request construction of {{SpanNearQuery}} and/or {{inOrder=true}} 
behavior?"). I've created LUCENE-8543, so discussion can move there if anyone's 
interested.

I also created LUCENE-8544, proposing the addition of support for 
{{(Multi)PhraseQuery}} phrase semantics in {{SpanNearQuery}}. I think it should 
be achievable, at least in the context of a proposed patch for 
[LUCENE-7398|https://issues.apache.org/jira/browse/LUCENE-7398?focusedCommentId=16630529#comment-16630529].


was (Author: mgibney):
Thanks [~thetaphi], and I agree that it would be a separate issue (^"Would it 
be worth opening a new issue to consider introducing the ability to 
specifically request construction of {{SpanNearQuery}} and/or {{inOrder=true}} 
behavior?"). I've created LUCENE-8543, so discussion can move there if anyone's 
interested.

I also created LUCENE-8544, proposing the addition of support for 
{{(Multi)PhraseQuery}} phrase semantics in {{SpanNearQuery}}. I think it should 
be achievable, at least in the context of a proposed patch for LUCENE-7398.

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662397#comment-16662397
 ] 

Michael Gibney commented on LUCENE-8531:


Thanks [~thetaphi], and I agree that it would be a separate issue (^"Would it 
be worth opening a new issue to consider introducing the ability to 
specifically request construction of {{SpanNearQuery}} and/or {{inOrder=true}} 
behavior?"). I've created LUCENE-8543, so discussion can move there if anyone's 
interested.

I also created LUCENE-8544, proposing the addition of support for 
{{(Multi)PhraseQuery}} phrase semantics in {{SpanNearQuery}}. I think it should 
be achievable, at least in the context of a proposed patch for LUCENE-7398.

 

 

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662397#comment-16662397
 ] 

Michael Gibney edited comment on LUCENE-8531 at 10/24/18 3:05 PM:
--

Thanks [~thetaphi], and I agree that it would be a separate issue (^"Would it 
be worth opening a new issue to consider introducing the ability to 
specifically request construction of {{SpanNearQuery}} and/or {{inOrder=true}} 
behavior?"). I've created LUCENE-8543, so discussion can move there if anyone's 
interested.

I also created LUCENE-8544, proposing the addition of support for 
{{(Multi)PhraseQuery}} phrase semantics in {{SpanNearQuery}}. I think it should 
be achievable, at least in the context of a proposed patch for LUCENE-7398.


was (Author: mgibney):
Thanks [~thetaphi], and I agree that it would be a separate issue (^"Would it 
be worth opening a new issue to consider introducing the ability to 
specifically request construction of {{SpanNearQuery}} and/or {{inOrder=true}} 
behavior?"). I've created LUCENE-8543, so discussion can move there if anyone's 
interested.

I also created LUCENE-8544, proposing the addition of support for 
{{(Multi)PhraseQuery}} phrase semantics in {{SpanNearQuery}}. I think it should 
be achievable, at least in the context of a proposed patch for LUCENE-7398.

 

 

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8544) In SpanNearQuery, add support for inOrder semantics equivalent to that of (Multi)PhraseQuery

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662390#comment-16662390
 ] 

Michael Gibney edited comment on LUCENE-8544 at 10/24/18 2:57 PM:
--

I can't immediately speak to the possibility of adding this functionality to 
the existing implementation of {{SpanNearQuery}}, but building on an 
outstanding patch for 
[LUCENE-7398|https://issues.apache.org/jira/browse/LUCENE-7398?focusedCommentId=16630529#comment-16630529],
 I think it might actually be pretty straightforward.

The above-referenced patch makes {{NearSpansOrdered}} aware of indexed 
{{PositionLengthAttribute}}. In a positionLength-aware context, it wasn't clear 
to me how to port the {{NearSpansOrdered}} changes to {{NearSpansUnordered}}; 
there were a number of ways to interpret the task, all of which looked pretty 
complicated and/or messy and/or difficult-verging-on-impossible to implement in 
a performant way (and at a higher level, they all seemed a bit semantically 
weird).

But positionLength-aware implementation of {{(Multi)PhraseQuery}} semantics in 
the context of the above-referenced patch should be much simpler: given that 
you have a fixed clause ordering, it just requires supporting negative offsets 
in calculation of slop/edit distance.


was (Author: mgibney):
I can't immediately speak to the possibility of adding this functionality to 
the existing implementation of {{SpanNearQuery}}, but building on an 
outstanding patch for LUCENE-7398, I think it might actually be pretty 
straightforward.

The above-referenced patch makes {{NearSpansOrdered}} aware of indexed 
{{PositionLengthAttribute}}. In a positionLength-aware context, it wasn't clear 
to me how to port the {{NearSpansOrdered}} changes to {{NearSpansUnordered}}; 
there were a number of ways to interpret the task, all of which looked pretty 
complicated and/or messy and/or difficult-verging-on-impossible to implement in 
a performant way (and at a higher level, they all seemed a bit semantically 
weird).

But positionLength-aware implementation of {{(Multi)PhraseQuery}} semantics in 
the context of the above-referenced patch should be much simpler: given that 
you have a fixed clause ordering, it just requires supporting negative offsets 
in calculation of slop/edit distance.

> In SpanNearQuery, add support for inOrder semantics equivalent to that of 
> (Multi)PhraseQuery
> 
>
> Key: LUCENE-8544
> URL: https://issues.apache.org/jira/browse/LUCENE-8544
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Michael Gibney
>Priority: Minor
>
> As discussed in LUCENE-8531, the semantics of phrase search differs among 
> {{(Multi)PhraseQuery}}, {{SpanNearQuery (inOrder=true)}}, and {{SpanNearQuery 
> (inOrder=false)}}:
>  * {{(Multi)PhraseQuery}}: incorporates the concept of order, and allows 
> negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=true)}}: incorporates the concept of order, and 
> does _not_ allow negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=false)}}: does not incorporate the concept of 
> order at all
> This issue concerns the possibility of adjusting {{SpanNearQuery}} to be 
> configurable to support semantics equivalent to that of 
> {{(Multi)PhraseQuery}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8544) In SpanNearQuery, add support for inOrder semantics equivalent to that of (Multi)PhraseQuery

2018-10-24 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662390#comment-16662390
 ] 

Michael Gibney commented on LUCENE-8544:


I can't immediately speak to the possibility of adding this functionality to 
the existing implementation of {{SpanNearQuery}}, but building on an 
outstanding patch for LUCENE-7398, I think it might actually be pretty 
straightforward.

The above-referenced patch makes {{NearSpansOrdered}} aware of indexed 
{{PositionLengthAttribute}}. In a positionLength-aware context, it wasn't clear 
to me how to port the {{NearSpansOrdered}} changes to {{NearSpansUnordered}}; 
there were a number of ways to interpret the task, all of which looked pretty 
complicated and/or messy and/or difficult-verging-on-impossible to implement in 
a performant way (and at a higher level, they all seemed a bit semantically 
weird).

But positionLength-aware implementation of {{(Multi)PhraseQuery}} semantics in 
the context of the above-referenced patch should be much simpler: given that 
you have a fixed clause ordering, it just requires supporting negative offsets 
in calculation of slop/edit distance.

> In SpanNearQuery, add support for inOrder semantics equivalent to that of 
> (Multi)PhraseQuery
> 
>
> Key: LUCENE-8544
> URL: https://issues.apache.org/jira/browse/LUCENE-8544
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Michael Gibney
>Priority: Minor
>
> As discussed in LUCENE-8531, the semantics of phrase search differs among 
> {{(Multi)PhraseQuery}}, {{SpanNearQuery (inOrder=true)}}, and {{SpanNearQuery 
> (inOrder=false)}}:
>  * {{(Multi)PhraseQuery}}: incorporates the concept of order, and allows 
> negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=true)}}: incorporates the concept of order, and 
> does _not_ allow negative offsets in calculating slop/edit distance
>  * {{SpanNearQuery (inOrder=false)}}: does not incorporate the concept of 
> order at all
> This issue concerns the possibility of adjusting {{SpanNearQuery}} to be 
> configurable to support semantics equivalent to that of 
> {{(Multi)PhraseQuery}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8544) In SpanNearQuery, add support for inOrder semantics equivalent to that of (Multi)PhraseQuery

2018-10-24 Thread Michael Gibney (JIRA)
Michael Gibney created LUCENE-8544:
--

 Summary: In SpanNearQuery, add support for inOrder semantics 
equivalent to that of (Multi)PhraseQuery
 Key: LUCENE-8544
 URL: https://issues.apache.org/jira/browse/LUCENE-8544
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Michael Gibney


As discussed in LUCENE-8531, the semantics of phrase search differs among 
{{(Multi)PhraseQuery}}, {{SpanNearQuery (inOrder=true)}}, and {{SpanNearQuery 
(inOrder=false)}}:
 * {{(Multi)PhraseQuery}}: incorporates the concept of order, and allows 
negative offsets in calculating slop/edit distance
 * {{SpanNearQuery (inOrder=true)}}: incorporates the concept of order, and 
does _not_ allow negative offsets in calculating slop/edit distance
 * {{SpanNearQuery (inOrder=false)}}: does not incorporate the concept of order 
at all

This issue concerns the possibility of adjusting {{SpanNearQuery}} to be 
configurable to support semantics equivalent to that of {{(Multi)PhraseQuery}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8543) Add QueryBuilder support for explicitly building SpanNearQuery and/or inOrder=true

2018-10-24 Thread Michael Gibney (JIRA)
Michael Gibney created LUCENE-8543:
--

 Summary: Add QueryBuilder support for explicitly building 
SpanNearQuery and/or inOrder=true
 Key: LUCENE-8543
 URL: https://issues.apache.org/jira/browse/LUCENE-8543
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/queryparser
Reporter: Michael Gibney


{{QueryBuilder}} has historically built phrases according to the semantics of 
{{(Multi)PhraseQuery}} (which incorporates the concept of order, but allows for 
negative offsets in calculating slop/edit distance).

LUCENE-8531 corrected a bug that substituted {{SpanNearQuery (inOrder=true)}} 
implementation for graph phrase queries despite the fact that for {{slop > 0}} 
the semantics of {{SpanNearQuery (inOrder=[true|false])}} differ from the 
semantics of {{(Multi)PhraseQuery}}.

Inspired by (but not related to) LUCENE-8531, this issue considers the 
likelihood that there are some common use cases for which {{SpanNearQuery}} 
semantics may be preferable to the semantics of {{PhraseQuery}}. The 
distinction between the two is clearer for the {{inOrder=true}} case of 
{{SpanNearQuery}}, which disallows negative offsets in calculating slop/edit 
distance.

The logic for building {{SpanNearQuery}} is already present in 
{{QueryBuilder}}; perhaps {{QueryBuilder}} could expose that logic so that it 
can be leveraged in cases that explicitly desire {{SpanNearQuery}} (and 
associated semantics).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-23 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661087#comment-16661087
 ] 

Michael Gibney edited comment on LUCENE-8531 at 10/23/18 8:34 PM:
--

> I think we should keep the default behavior as is. You can still override 
> QueryBuilder#analyzeGraphPhrase to apply a different logic on your side if 
> you want.

Certainly agreed the default behavior should be left as-is. I'm content with 
the flexibility to override, but my suggestion was based on a sense that the 
desire to support {{inOrder=true}} could be a pretty common use case.

The API does specify "phrase", but with a lower-case "p", does this necessarily 
imply that exclusively {{PhraseQuery}} semantics _should_ be supported? It's 
the de facto case that {{PhraseQuery}} semantics _have been_ supported, so it 
definitely makes sense for that to continue to be the default – but I don't 
think it'd be unreasonable to add configurable stock support for 
{{inOrder=true}}. If such support were to be added, {{QueryBuilder}} would seem 
like a logical place to do it, and since the logic necessary to implement is 
already here (in {{analyzeGraphPhrase}}), it should be a trivial addition.

I'm thinking something along the lines of splitting the {{SpanNearQuery}} part 
of \{{analyzeGraphPhrase()}} everything after the "\{{if (phraseSlop > 0)}}" 
shortcircuit) into its own method. Even if split into a protected method, this 
would allow any override of {{analyzeGraphPhrase()}} to more cleanly leverage 
the existing logic for building {{SpanNearQuery}}.

I'm just explaining my thinking here; I guess the decision ultimately depends 
on how general a use case folks consider {{inOrder=true}} to be.


was (Author: mgibney):
> I think we should keep the default behavior as is. You can still override 
> QueryBuilder#analyzeGraphPhrase to apply a different logic on your side if 
> you want.

Certainly agreed the default behavior should be left as-is. I'm content with 
the flexibility to override, but my suggestion was based on a sense that the 
desire to support {{inOrder=true}} could be a pretty common use case.

The API does specify "phrase", but with a lower-case "p", does this necessarily 
imply that exclusively {{PhraseQuery}} semantics _should_ be supported? It's 
the de facto case that {{PhraseQuery}} semantics _have been_ supported, so it 
definitely makes sense for that to continue to be the default – but I don't 
think it'd be unreasonable to add configurable stock support for 
{{inOrder=true}}. If such support were to be added, {{QueryBuilder}} would seem 
like a logical place to do it, and since the logic necessary to implement is 
already here (in {{analyzeGraphPhrase}}), it should be a trivial addition.

I'm thinking something along the lines of splitting the {{SpanNearQuery}} part 
of {{analyzeGraphPhrase (}}everything after the "{{if (phraseSlop > 0)}}" 
shortcircuit) into its own method. Even if split into a protected method, this 
would allow any override of {{analyzeGraphPhrase}} to more cleanly leverage the 
existing logic for building {{SpanNearQuery}}.

I'm just explaining my thinking here; I guess the decision ultimately depends 
on how general a use case folks consider {{inOrder=true}} to be.

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  

[jira] [Commented] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-23 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661087#comment-16661087
 ] 

Michael Gibney commented on LUCENE-8531:


> I think we should keep the default behavior as is. You can still override 
> QueryBuilder#analyzeGraphPhrase to apply a different logic on your side if 
> you want.

Certainly agreed the default behavior should be left as-is. I'm content with 
the flexibility to override, but my suggestion was based on a sense that the 
desire to support {{inOrder=true}} could be a pretty common use case.

The API does specify "phrase", but with a lower-case "p", does this necessarily 
imply that exclusively {{PhraseQuery}} semantics _should_ be supported? It's 
the de facto case that {{PhraseQuery}} semantics _have been_ supported, so it 
definitely makes sense for that to continue to be the default – but I don't 
think it'd be unreasonable to add configurable stock support for 
{{inOrder=true}}. If such support were to be added, {{QueryBuilder}} would seem 
like a logical place to do it, and since the logic necessary to implement is 
already here (in {{analyzeGraphPhrase}}), it should be a trivial addition.

I'm thinking something along the lines of splitting the {{SpanNearQuery}} part 
of {{analyzeGraphPhrase (}}everything after the "{{if (phraseSlop > 0)}}" 
shortcircuit) into its own method. Even if split into a protected method, this 
would allow any override of {{analyzeGraphPhrase}} to more cleanly leverage the 
existing logic for building {{SpanNearQuery}}.

I'm just explaining my thinking here; I guess the decision ultimately depends 
on how general a use case folks consider {{inOrder=true}} to be.

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8531) QueryBuilder hard-codes inOrder=true for generated sloppy span near queries

2018-10-23 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660880#comment-16660880
 ] 

Michael Gibney commented on LUCENE-8531:


I recognize that this was a bug (in that using {{SpanNearQuery}} with 
{{inOrder=true}} and {{slop > 0}} changed the behavior, rather than simply the 
implementation, of the built query).

That said, there surely are potential use cases for the {{inOrder=true}} 
behavior, which is supported by {{SpanNearQuery}} but not by 
({{Multi)PhraseQuery}}. Would it be worth opening a new issue to consider 
introducing the ability to specifically request construction of 
{{SpanNearQuery}} and/or {{inOrder=true}} behavior? The work that went into 
building {{SpanNearQuery}} for phrases (commit 
[96e8f0a0afe|https://github.com/apache/lucene-solr/commit/96e8f0a0afeb68e2d07ec1dda362894f0b94333d])
 is still useful and relevant, even if the result isn't backward-compatible for 
the case where {{slop > 0}}.

> QueryBuilder hard-codes inOrder=true for generated sloppy span near queries
> ---
>
> Key: LUCENE-8531
> URL: https://issues.apache.org/jira/browse/LUCENE-8531
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Steve Rowe
>Assignee: Steve Rowe
>Priority: Major
> Fix For: 7.6, master (8.0)
>
> Attachments: LUCENE-8531.patch
>
>
> QueryBuilder.analyzeGraphPhrase() generates SpanNearQuery-s with passed-in 
> phraseSlop, but hard-codes inOrder ctor param as true.
> Before multi-term synonym support and graph token streams introduced the 
> possibility of generating SpanNearQuery-s, QueryBuilder generated 
> (Multi)PhraseQuery-s, which always interpret slop as allowing reordering 
> edits.  Solr's eDismax query parser generates phrase queries when its 
> pf/pf2/pf3 params are specified, and when multi-term synonyms are used with a 
> graph-aware synonym filter, SpanNearQuery-s are generated that require 
> clauses to be in order; unlike with (Multi)PhraseQuery-s, reordering edits 
> are not allowed, so this is a kind of regression.  See SOLR-12243 for edismax 
> pf/pf2/pf3 context.  (Note that the patch on SOLR-12243 also addresses 
> another problem that blocks eDismax from generating queries *at all* under 
> the above-described circumstances.)
> I propose adding a new analyzeGraphPhrase() method that allows configuration 
> of inOrder, which would allow eDismax to specify inOrder=false.  The existing 
> analyzeGraphPhrase() method would remain with its hard-coded inOrder=true, so 
> existing client behavior would remain unchanged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2018-09-27 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630529#comment-16630529
 ] 

Michael Gibney edited comment on LUCENE-7398 at 9/27/18 3:01 PM:
-

I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # [Anticipate some possible 
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
 of new functionality that would be enabled by this (or other equivalent) fix

Some notes:
 # The branch contains (and passes) all tests proposed so far in association 
with this issue (and also quite a few additional tests)
 # The candidate implementation is made more complete and performant by the 
addition of some extra information in the index (e.g., {{positionLength}}). 
This extra information is currently stored using {{Payload}} s, though for 
{{positionLength}} at least, there has been some discussion of integrating it 
more directly in the index (see LUCENE-4312, LUCENE-3843)
 # Some version of this code has been running in production for several months, 
and has given no indication of instability, even running every user phrase 
query (both explicit and {{pf}}) as a graph query.
 # To facilitate evaluation, the fix is integrated in 
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], 
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], 
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
 and 
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].


was (Author: mgibney):
I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # 

[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2018-09-27 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630529#comment-16630529
 ] 

Michael Gibney commented on LUCENE-7398:


I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # [Anticipate some possible 
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
 of new functionality that would be enabled by this (or other equivalent) fix

Some notes:
 # The branch contains (and passes) all tests proposed so far in association 
with this issue (and also quite a few additional tests)
 # The candidate implementation is made more complete and performant by the 
addition of some extra information in the index (e.g., {{positionLength}}). 
This extra information is currently stored using {{Payload}} s, though for 
{{positionLength}} at least, there has been some discussion of integrating it 
more directly in the index (see LUCENE-4312)
 # Some version of this code has been running in production for several months, 
and has given no indication of instability, even running every user phrase 
query (both explicit and {{pf}}) as a graph query.
 # To facilitate evaluation, the fix is integrated in 
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], 
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], 
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
 and 
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, 
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7848) QueryBuilder.analyzeGraphPhrase does not handle gaps correctly

2018-09-26 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629705#comment-16629705
 ] 

Michael Gibney commented on LUCENE-7848:


A patch equivalent to the [^LUCENE-7848-delimOnly-token-offset.patch] of 
14/Jul/2017 has been merged with LUCENE-8395. I think the remaining problems 
related to this issue are more directly addressed by LUCENE-7398.

> QueryBuilder.analyzeGraphPhrase does not handle gaps correctly
> --
>
> Key: LUCENE-7848
> URL: https://issues.apache.org/jira/browse/LUCENE-7848
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.5, 6.6
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-7848-branching-spanOr.patch, 
> LUCENE-7848-delimOnly-token-offset.patch, LUCENE-7848.patch, 
> LUCENE-7848.patch, capture-3.png
>
>
> Position increments greater than 1 are ignored when the query builder creates 
> a graph phrase query. 
> Instead it should use SpanNearQuery.addGap for pos incr > 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-23 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374861#comment-16374861
 ] 

Michael Gibney commented on SOLR-7798:
--

Sorry, yes! I see. The test case was from [~joergr]'s patch (July 2015). I 
incorporated an updated version of the test case (along with a new commit 
message), and pushed a new commit to [PR 
325|https://github.com/apache/lucene-solr/pull/325]. Feel free to use this as 
you see fit – I'm happy to squash-rebase against master if you like. Thanks!

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373938#comment-16373938
 ] 

Michael Gibney commented on SOLR-7798:
--

It looks like the randomness comes from [Line 58 or 
TestExpandComponent|https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/handler/component/TestExpandComponent.java#L58],
 when "hint=top_fc" is randomly specified; the problem arises when it's 
specified for a field with no SortedDocValues (the {{null}} comes from 
[here|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/uninverting/UninvertingReader.java#L349]).

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373889#comment-16373889
 ] 

Michael Gibney commented on SOLR-7798:
--

I was able to reproduce an NPE with the above command; but the test only threw 
this NPE intermittently, and I was able to reproduce it on master 
(364b680afaf9) as well. I've included the stack trace to make sure we're 
talking about the same issue.
{code:java}
[junit4] 2> 2778 ERROR (searcherExecutor-7-thread-1) [ ] o.a.s.s.LRUCache Error 
during auto-warming of 
key:org.apache.solr.search.QueryResultKey@7ce6ad2e:java.lang.RuntimeException: 
java.lang.NullPointerException
[junit4] 2> at 
org.apache.solr.search.CollapsingQParserPlugin$CollapsingPostFilter.getFilterCollector(CollapsingQParserPlugin.java:378)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1084)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1540)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1416)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher.access$000(SolrIndexSearcher.java:90)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher$3.regenerateItem(SolrIndexSearcher.java:575)
[junit4] 2> at org.apache.solr.search.LRUCache.warm(LRUCache.java:297)
[junit4] 2> at 
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:2146)
[junit4] 2> at 
org.apache.solr.core.SolrCore.lambda$getSearcher$16(SolrCore.java:2258)
[junit4] 2> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[junit4] 2> at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
[junit4] 2> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[junit4] 2> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[junit4] 2> at java.lang.Thread.run(Thread.java:745)
[junit4] 2> Caused by: java.lang.NullPointerException
[junit4] 2> at 
org.apache.solr.search.CollapsingQParserPlugin$OrdScoreCollector.(CollapsingQParserPlugin.java:514)
[junit4] 2> at 
org.apache.solr.search.CollapsingQParserPlugin$CollectorFactory.getCollector(CollapsingQParserPlugin.java:1331)
[junit4] 2> at 
org.apache.solr.search.CollapsingQParserPlugin$CollapsingPostFilter.getFilterCollector(CollapsingQParserPlugin.java:367)
[junit4] 2> ... 13 more
{code}

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373489#comment-16373489
 ] 

Michael Gibney commented on SOLR-7798:
--

I was indeed developing on master. Which test case is failing for you? I'm 
getting a couple of failed tests on master (364b680afaf9) that seem to be 
unrelated to the ExpandComponent changes:
{code}
 [junit4] Tests with failures [seed: 2723B1A9FC179033]:
 [junit4]   - 
org.apache.solr.handler.component.DummyCustomParamSpellChecker.initializationError
 [junit4]   - 
org.apache.solr.handler.component.ResourceSharingTestComponent.initializationError
{code}

but when I try to selectively run the ExpandComponent test I'm getting:

{code}
ant -Dtests.class="org.apache.solr.handler.component.TestExpandComponent" test
...
 [junit4] Tests summary: 0 suites, 0 tests
{code}

No errors ... but I guess something is still amiss.


> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373197#comment-16373197
 ] 

Michael Gibney commented on SOLR-7798:
--

Right, sounds good! Thanks for the explanation of the 200 ceiling. Just 
submitted [PR 325|https://github.com/apache/lucene-solr/pull/325].

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372962#comment-16372962
 ] 

Michael Gibney edited comment on SOLR-7798 at 2/22/18 3:46 PM:
---

Thanks, [~joel.bernstein]. I'm happy to prep a PR, but would you mind first 
confirming that {{count}} (and its associated ceiling of 200) is intended to 
represent the number of matching collapse _values_, as opposed to the number of 
result documents associated with those values? Assuming that's the case, is 
there any reason to continue tracking {{count}} externally (as opposed to 
simply relying on {{ordBytes.size(), as in [~joergr]'s 
[^expand-component.patch] patch}})?


was (Author: mgibney):
Thanks, [~joel.bernstein]. I'm happy to prep a PR, but would you mind first 
confirming that {{count}} (and its associated ceiling of 200) is intended to 
represent the number of matching collapse _values_, as opposed to the number of 
result documents associated with those values? Assuming that's the case, is 
there any reason to continue trac{{king }}{{count}} externally (as opposed to 
simply relying on {{ordBytes.size(), as in [~joergr]'s 
[^expand-component.patch] patch}})?

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-22 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372962#comment-16372962
 ] 

Michael Gibney commented on SOLR-7798:
--

Thanks, [~joel.bernstein]. I'm happy to prep a PR, but would you mind first 
confirming that {{count}} (and its associated ceiling of 200) is intended to 
represent the number of matching collapse _values_, as opposed to the number of 
result documents associated with those values? Assuming that's the case, is 
there any reason to continue trac{{king }}{{count}} externally (as opposed to 
simply relying on {{ordBytes.size(), as in [~joergr]'s 
[^expand-component.patch] patch}})?

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Assignee: Joel Bernstein
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-21 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371599#comment-16371599
 ] 

Michael Gibney commented on SOLR-7798:
--

Although [~joergr]'s initial description mentions an NPE in ExpandComponent "if 
_accidentally_ used without prior collapsing of results" (italics mine), there 
are applications of ExpandComponent that _intentionally_ do not involve prior 
collapsing of results on the expand field. For example, I'm using cached Join 
queries to implement tiered deduplication of the search domain across multiple 
document sources, but do not wish to deduplicate documents against other 
documents from the same source (and specifically wish to deduplicate the search 
domain, as opposed to the set of results). The approach is described in a bit 
more detail [here|https://github.com/upenn-libraries/solr-source-deduplication] 
(bullet points 3, 4, and 7 are particularly relevant).

[^expand-component.patch] looks good to me, as I can't see a reason why 
{{count}} is being tracked separately, rather than relying on 
{{ordBytes.size()}}. The only potential issue I see with it is that where 
{{count}} is used to determine whether {{groupQuery}} is initialized, {{count}} 
now represents a different concept than {{ordBytes.size()}}. I'm not sure what 
the desired behavior would be (or for that matter, what the explanation is for 
the magic "200" ceiling on {{count)}}.

I've uploaded an alternative, [^expand-npe.patch] , which differs only in that 
it leaves the separate tracking of {{count}} in place (though I don't think it 
should have to), and also in that it checks for duplication on addition of ord 
to groupBits/groupSet, thereby avoiding unnecessary {{BytesRef.deepCopyOf()}} 
in the (normally rare) case where duplicate terms are encountered.

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7798) Improve robustness of ExpandComponent

2018-02-21 Thread Michael Gibney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-7798:
-
Attachment: expand-npe.patch

> Improve robustness of ExpandComponent
> -
>
> Key: SOLR-7798
> URL: https://issues.apache.org/jira/browse/SOLR-7798
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Reporter: Jörg Rathlev
>Priority: Minor
> Attachments: expand-component.patch, expand-npe.patch
>
>
> The {{ExpandComponent}} causes a {{NullPointerException}} if accidentally 
> used without prior collapsing of results.
> If there are multiple documents in the result which have the same term value 
> in the expand field, the size of the {{ordBytes}}/{{groupSet}} differs from 
> the {{count}} value, and the {{getGroupQuery}} method creates an incompletely 
> filled {{bytesRef}} array, which later causes a {{NullPointerException}} when 
> trying to sort the terms.
> The attached patch extends the test to demonstrate the error, and modifies 
> the {{getGroupQuery}} methods to create the array based on the size of the 
> input maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7848) QueryBuilder.analyzeGraphPhrase does not handle gaps correctly

2017-07-14 Thread Michael Gibney (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-7848:
---
Attachment: LUCENE-7848-delimOnly-token-offset.patch

I think the remaining problem is that WordDelimiterGraphFilter is swallowing 
delim-only tokens and leaving a gap even when PRESERVE_ORIGINAL is true. 
[^LUCENE-7848-delimOnly-token-offset.patch] fixes this (and addresses the 
problematic gaps).

> QueryBuilder.analyzeGraphPhrase does not handle gaps correctly
> --
>
> Key: LUCENE-7848
> URL: https://issues.apache.org/jira/browse/LUCENE-7848
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.5, 6.6
>Reporter: Jim Ferenczi
> Attachments: capture-3.png, LUCENE-7848-branching-spanOr.patch, 
> LUCENE-7848-delimOnly-token-offset.patch, LUCENE-7848.patch, LUCENE-7848.patch
>
>
> Position increments greater than 1 are ignored when the query builder creates 
> a graph phrase query. 
> Instead it should use SpanNearQuery.addGap for pos incr > 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7848) QueryBuilder.analyzeGraphPhrase does not handle gaps correctly

2017-07-12 Thread Michael Gibney (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083965#comment-16083965
 ] 

Michael Gibney edited comment on LUCENE-7848 at 7/12/17 5:36 PM:
-

"Could be a bug somewhere in span queries."^ -- I think the remaining problem 
here is that only one branch (the shortest) of a SpanOrQuery is evaluated, at 
which point the "spanOr" is designated a match (or not) of the 
width/positionEnd of the shortest branch. When the branches of a "spanOr" 
differ in length (as they will as a matter of course for uses of GraphFilters 
such as in the above test), the shorter branch is evaluated, but if a longer 
branch is also a match, it affects the offset of subsequent tokens, and the 
enclosing "spanNear" sees a larger-than-expected slop, and fails to match. 

[^LUCENE-7848-branching-spanOr.patch] adjusts SpanOrQuery to support repeated 
calls to nextStartPosition() which return the same startPosition, but different 
endPositions. The subSpan clauses of the "spanOr" are popped off the 
priorityQueue, retained, and restored upon exhaustion of subSpans (when it's 
time to move on to the next potential match). Some corresponding changes were 
necessary to make NearSpansOrdered aware of the new "spanOr" behavior, and 
conditionally evaluate as many branches of "spanOr" clauses as necessary to 
match (or not) on the full "nearSpan".

There may be other modifications needed in code that can call the modified 
"spanOr" and would need to be aware of its new behavior, but with this patch 
applied, all the tests in the TestWordDelimiterGraphFilter pass (including the 
new testLucene7848()). 

EDIT: original patch had a bug, was re-uploaded a few hours after initially 
posted.


was (Author: mgibney):
"Could be a bug somewhere in span queries."^ -- I think the remaining problem 
here is that only one branch (the shortest) of a SpanOrQuery is evaluated, at 
which point the "spanOr" is designated a match (or not) of the 
width/positionEnd of the shortest branch. When the branches of a "spanOr" 
differ in length (as they will as a matter of course for uses of GraphFilters 
such as in the above test), the shorter branch is evaluated, but if a longer 
branch is also a match, it affects the offset of subsequent tokens, and the 
enclosing "spanNear" sees a larger-than-expected slop, and fails to match. 

[^LUCENE-7848-branching-spanOr.patch] adjusts SpanOrQuery to support repeated 
calls to nextStartPosition() which return the same startPosition, but different 
endPositions. The subSpan clauses of the "spanOr" are popped off the 
priorityQueue, retained, and restored upon exhaustion of subSpans (when it's 
time to move on to the next potential match). Some corresponding changes were 
necessary to make NearSpansOrdered aware of the new "spanOr" behavior, and 
conditionally evaluate as many branches of "spanOr" clauses as necessary to 
match (or not) on the full "nearSpan".

There may be other modifications needed in code that can call the modified 
"spanOr" and would need to be aware of its new behavior, but with this patch 
applied, all the tests in the TestWordDelimiterGraphFilter pass (including the 
new testLucene7848()). 

> QueryBuilder.analyzeGraphPhrase does not handle gaps correctly
> --
>
> Key: LUCENE-7848
> URL: https://issues.apache.org/jira/browse/LUCENE-7848
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.5, 6.6
>Reporter: Jim Ferenczi
> Attachments: capture-3.png, LUCENE-7848-branching-spanOr.patch, 
> LUCENE-7848.patch, LUCENE-7848.patch
>
>
> Position increments greater than 1 are ignored when the query builder creates 
> a graph phrase query. 
> Instead it should use SpanNearQuery.addGap for pos incr > 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7848) QueryBuilder.analyzeGraphPhrase does not handle gaps correctly

2017-07-12 Thread Michael Gibney (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-7848:
---
Attachment: (was: LUCENE-7848-branching-spanOr.patch)

> QueryBuilder.analyzeGraphPhrase does not handle gaps correctly
> --
>
> Key: LUCENE-7848
> URL: https://issues.apache.org/jira/browse/LUCENE-7848
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.5, 6.6
>Reporter: Jim Ferenczi
> Attachments: capture-3.png, LUCENE-7848-branching-spanOr.patch, 
> LUCENE-7848.patch, LUCENE-7848.patch
>
>
> Position increments greater than 1 are ignored when the query builder creates 
> a graph phrase query. 
> Instead it should use SpanNearQuery.addGap for pos incr > 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7848) QueryBuilder.analyzeGraphPhrase does not handle gaps correctly

2017-07-12 Thread Michael Gibney (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated LUCENE-7848:
---
Attachment: LUCENE-7848-branching-spanOr.patch

> QueryBuilder.analyzeGraphPhrase does not handle gaps correctly
> --
>
> Key: LUCENE-7848
> URL: https://issues.apache.org/jira/browse/LUCENE-7848
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.5, 6.6
>Reporter: Jim Ferenczi
> Attachments: capture-3.png, LUCENE-7848-branching-spanOr.patch, 
> LUCENE-7848.patch, LUCENE-7848.patch
>
>
> Position increments greater than 1 are ignored when the query builder creates 
> a graph phrase query. 
> Instead it should use SpanNearQuery.addGap for pos incr > 1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >