date:20221221

[jira] [Commented] (SOLR-15859) Add handler to dump filter cache

2022-12-21 Thread Shawn Heisey (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651088#comment-17651088
 ] 

Shawn Heisey commented on SOLR-15859:
-

{quote}I think it might be possible (and preferable?) to implement this as a 
custom {{SolrCache}} implementation that wraps {{{}solr.CaffeineCache>{}}}. I think [~ben.manes] was alluding to something like 
this "MetadataWrapper" approach in his comment above.
{quote}
I have no idea how to go from those statements to actual usable code.  And I 
don't want to ask you to write it for me, I'd like to do that myself.  But if 
you can come up with back-of-the-envelope pseudocode very quickly that I can 
flesh out into actual code, that would be appreciated.

If what you're describing would involve changes to the way that specific caches 
(like filterCache) get implemented, then I'm REALLY going to be out of my 
depth.  I once tried to look at that and got completely lost trying to follow 
the code.  In much the same way as what happened when I tried to understand 
SolrCloud cluster management with the overseer.

> Add handler to dump filter cache
> 
>
> Key: SOLR-15859
> URL: https://issues.apache.org/jira/browse/SOLR-15859
> Project: Solr
>  Issue Type: Improvement
>Reporter: Andy Lester
>Assignee: Shawn Heisey
>Priority: Major
>  Labels: FQ, cache, filtercache, metrics
> Attachments: cacheinfo-1.patch, cacheinfo-2.patch, cacheinfo.patch, 
> fix_92_startup.patch
>
>
> It would be very helpful to be able to inspect the contents of the 
> filterCache.
> I'd like to be able to query something like
> {{/admin/caches?type=filter&nentries=1000&sort=numHits+DESC}}
> nentries would be allowed to be -1 to get everything.
> It would be nice to see these data items for each entry. I don't know which 
> are available, but I'm thinking blue sky here:
>  * cache key, exactly as stored
>  * Timestamp when the entry was inserted
>  * Whether the insertion of the entry evicted another entry, and if so which 
> one
>  * Timestamp of when this entry was last hit
>  * Number of hits on this entry forever
>  * Number of hits on this entry over some time period
>  * Number of documents matched by the filter
>  * Number of bytes of memory used by the filter
> These are the sorts of questions I'd like to be able answer:
>  * "I just did a query that I expect will have added a cache entry. Did it?"
>  * "Are my queries hitting existing cache entries?"
>  * "How big should I set my filterCache size? Should I limit it by number of 
> entries or RAM usage?"
>  * "Which of my FQs are getting used the most? These are the ones I want in 
> my firstSearcher queries." (I currently determine this by processing my old 
> solr logs)
>  * "Which filters give me the most bang for the buck in terms of RAM usage?"
>  * "I have filter X and filter Y, but would it be beneficial if I made a 
> filter X AND Y?"
>  * "Which FQs are used more at certain times of the day? (Assuming I take 
> regular snapshots throughout the day)"
> I imagine a response might look like:
> {{{}}
> {{  "responseHeader": {}}
> {{    "status": 0,}}
> {{    "QTime": 961}}
> {{  },}}
> {{  "response": {}}
> {{    "numFound": 12104,}}
> {{    "filterCacheKeys": {}}
> {{      [}}
> {{        "language:eng": {}}
> {{          "inserted": "2021-12-04T07:34:16Z",}}
> {{          "lastHit": "2021-12-04T18:17:43Z",}}
> {{          "numHits": 15065,}}
> {{          "numHitsInPastHour": 2319,}}
> {{          "evictedKey": "agelevel:4 shippable:Y",}}
> {{          "numRecordsMatchedByFilter": 24328753,}}
> {{          "bytesUsed": 3041094}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "is_set:N": {}}
> {{          ...}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "language:spa": {}}
> {{          ...}}
> {{        }}}
> {{      ]}}
> {{    }}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] dsmiley commented on a diff in pull request #1215: DocRouter: strengthen abstraction

2022-12-21 Thread GitBox



dsmiley commented on code in PR #1215:
URL: https://github.com/apache/solr/pull/1215#discussion_r105508


##
solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java:
##
@@ -263,8 +263,9 @@ private void handleGetRanges(CoreAdminHandler.CallInfo it, 
String coreName) thro
 DocCollection collection = clusterState.getCollection(collectionName);
 String sliceName = 
parentCore.getCoreDescriptor().getCloudDescriptor().getShardId();
 Slice slice = collection.getSlice(sliceName);
-DocRouter router =
-collection.getRouter() != null ? collection.getRouter() : 
DocRouter.DEFAULT;
+CompositeIdRouter router =

Review Comment:
   Correcting myself.  This method, `handleGetRanges`, is only called for 
splitByPrefix (as its javadocs say), which is what depends on CompositeIdRouter.
   
   I suppose if there was some weird/unexpected code path (maybe in the 
future), a ClassCastException wouldn't be particularly unfriendly?



##
solr/core/src/java/org/apache/solr/cloud/api/collections/MigrateCmd.java:
##
@@ -253,7 +252,7 @@ private void migrateKey(
 SHARD_ID_PROP,
 sourceSlice.getName(),
 "routeKey",
-SolrIndexSplitter.getRouteKey(splitKey) + "!",
+sourceRouter.getRouteKeyNoSuffix(splitKey) + "!",

Review Comment:
   Line 108 checks for CompositeIdRouter and throws a friendly exception if it 
isn't.



##
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##
@@ -765,18 +766,11 @@ static FixedBitSet[] split(
 return docSets;
   }
 
-  public static String getRouteKey(String idString) {
-int idx = idString.indexOf(CompositeIdRouter.SEPARATOR);
-if (idx <= 0) return null;
-String part1 = idString.substring(0, idx);
-int commaIdx = part1.indexOf(CompositeIdRouter.bitsSeparator);
-if (commaIdx > 0 && commaIdx + 1 < part1.length()) {
-  char ch = part1.charAt(commaIdx + 1);
-  if (ch >= '0' && ch <= '9') {
-part1 = part1.substring(0, commaIdx);
-  }
+  private static void checkRouterSupportsSplitKey(HashBasedRouter hashRouter, 
String splitKey) {

Review Comment:
   SolrIndexSplitterTest tests the "plain" (hash) router.  The test continues 
to pass.  It's only the "splitKey" feature of shard splitting that requires 
CompositeIdRouter.  The exception message tries to clarify that the 
expectation/requirement is tied to splitKey.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



noblepaul commented on PR #1242:
URL: https://github.com/apache/solr/pull/1242#issuecomment-1362371622

   >Avoid making copies of DocCollections when copyWith is called (related to 
PRS updates)
   
   Yes
   
   > Avoid fetching PRS states until the state is actually queried (the Lazy 
PRS provider part)
   
   This behavior is not changed. But what is changed is all states are 
**always** queried just in time. Prior to this , the states were copied into 
the object when it was constructed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



noblepaul commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1055057270


##
solr/solrj/src/java/org/apache/solr/common/cloud/DocCollection.java:
##
@@ -139,30 +138,10 @@ public static String getCollectionPathRoot(String coll) {
* only a replica is updated
*/
   public DocCollection copyWith(PerReplicaStates newPerReplicaStates) {

Review Comment:
   when per-replica states change the slices remain the exactly the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054994623


##
solr/solrj/src/java/org/apache/solr/common/cloud/DocCollection.java:
##
@@ -139,30 +138,10 @@ public static String getCollectionPathRoot(String coll) {
* only a replica is updated
*/
   public DocCollection copyWith(PerReplicaStates newPerReplicaStates) {

Review Comment:
   Based on the old code, the modified replicas would be read from the 
newPerReplicaStates provided to construct a list as `modifiedShards` (which 
each Slice  value instead contains a map of replicas, which info relies on the 
input `newPerReplicaStates`).
   
   Since we are not constructing a new DocCollection here and instead returning 
the same `this` instance, how can we ensure the such DocCollection instance 
getSlice calls (`getSlices`, `getSliceMap` etc) returns the updated 
slice/replica info from `newPerReplicaStates`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054994623


##
solr/solrj/src/java/org/apache/solr/common/cloud/DocCollection.java:
##
@@ -139,30 +138,10 @@ public static String getCollectionPathRoot(String coll) {
* only a replica is updated
*/
   public DocCollection copyWith(PerReplicaStates newPerReplicaStates) {

Review Comment:
   Based on the old code, the modified replicas would be read from the 
newPerReplicaStates provided to construct a list as `modifiedShards` (which 
each Slice  value instead contains a map of replicas, which info relies on the 
input `newPerReplicaStates`).
   
   Since we are not constructing a new DocCollection here and instead returning 
the same `this` instance, how can we ensure the such DocCollection instance 
getSlice calls return the updated replica info from `newPerReplicaStates`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on pull request #1215: DocRouter: strengthen abstraction

2022-12-21 Thread GitBox



noblepaul commented on PR #1215:
URL: https://github.com/apache/solr/pull/1215#issuecomment-1362270740

   ```To me this PR does not add a new type of DocRouter, as it is centered 
around CompositeIdRouter```
   
   I was confused by the original description.  thought this was trying to 
introduce a new `DocRouter` by enhancing `CompositeIdRouter`
   
   This PR is about  moving all the logic of routing /splitting into  
`CompositeIdRouter`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on a diff in pull request #1215: DocRouter: strengthen abstraction

2022-12-21 Thread GitBox



noblepaul commented on code in PR #1215:
URL: https://github.com/apache/solr/pull/1215#discussion_r1054986541


##
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##
@@ -765,18 +766,11 @@ static FixedBitSet[] split(
 return docSets;
   }
 
-  public static String getRouteKey(String idString) {
-int idx = idString.indexOf(CompositeIdRouter.SEPARATOR);
-if (idx <= 0) return null;
-String part1 = idString.substring(0, idx);
-int commaIdx = part1.indexOf(CompositeIdRouter.bitsSeparator);
-if (commaIdx > 0 && commaIdx + 1 < part1.length()) {
-  char ch = part1.charAt(commaIdx + 1);
-  if (ch >= '0' && ch <= '9') {
-part1 = part1.substring(0, commaIdx);
-  }
+  private static void checkRouterSupportsSplitKey(HashBasedRouter hashRouter, 
String splitKey) {

Review Comment:
   the other one is PLAIN router. I don't think Split is possible for that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



noblepaul commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054985346


##
solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java:
##
@@ -261,9 +259,10 @@ private static DocCollection collectionFromObjects(
   if (log.isDebugEnabled()) {
 log.debug("a collection {} has per-replica state", name);
   }
-  // this collection has replica states stored outside
-  ReplicaStatesProvider rsp = REPLICASTATES_PROVIDER.get();
-  if (rsp instanceof StatesProvider) ((StatesProvider) 
rsp).isPerReplicaState = true;
+} else {
+  // prior to this call, PRS provider is set. We should unset it before
+  // deserializing the replicas and slices
+  DocCollection.clearReplicaStateProvider();

Review Comment:
   I'm aware of this problem . The ideal solution would be to pass on the 
`PrsSupplier` with the constructor. I'm trying to do that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] noblepaul commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



noblepaul commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054984613


##
solr/solrj/src/java/org/apache/solr/common/cloud/DocCollection.java:
##
@@ -139,30 +138,10 @@ public static String getCollectionPathRoot(String coll) {
* only a replica is updated
*/
   public DocCollection copyWith(PerReplicaStates newPerReplicaStates) {

Review Comment:
   Wy would slices change when PRS change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr-operator] HoustonPutman opened a new pull request, #509: Fix non-recurring backups

2022-12-21 Thread GitBox



HoustonPutman opened a new pull request, #509:
URL: https://github.com/apache/solr-operator/pull/509

   https://github.com/apache/solr-operator/pull/455 introduced a bug for 
non-recurring backups that was unearthed while working on 
https://github.com/apache/solr-operator/pull/507
   
   Basically we need to only update the `NextScheduledTimestamp` if 
`recurrence` is enabled.
   
   I also restructured the logic to hopefully make it more clear when backup 
logic should be run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-15859) Add handler to dump filter cache

2022-12-21 Thread Shawn Heisey (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650996#comment-17650996
 ] 

Shawn Heisey commented on SOLR-15859:
-

[~magibney] I'm not going to rule anything out that hasn't been carefully 
considered.  I fully admit that I am playing in a sandbox that has more 
complexity than I am used to thinking about, so I doubt I can actually get this 
done without some collaboration.

Anything that reduces or eliminates the amount of synchronization that I have 
to worry about, especially if it actually makes the code simpler, is very 
welcome.  I never feel confident about code for a threaded environment where I 
don't put some thought into thread safety issues, so I think I have a tendency 
to overthink it.

I don't really mind if the cache dumper knows at least a little bit about the 
internals it is dealing with, but the more that can be abstracted, the better.  
I'm hoping to get to a point where it only knows about SolrCacheBase and 
doesn't care about CaffeineCache.  But obviously some work will be required in 
the cache implementation to make that abstraction available.

I expect where the dumper will be most connected to other Solr/Lucene internals 
is knowing how to dump each specific cache -- filterCache is very different 
than queryResultCache.

What I envision with the dumper is initially making it an experimental feature. 
 I think it might be useful to have a section of the ref guide dedicated to 
experimental features, where the API and internals of the feature may radically 
change from release to release if a better approach is found.  Maybe treat it a 
little bit like the ASF does when incubating new projects.

> Add handler to dump filter cache
> 
>
> Key: SOLR-15859
> URL: https://issues.apache.org/jira/browse/SOLR-15859
> Project: Solr
>  Issue Type: Improvement
>Reporter: Andy Lester
>Assignee: Shawn Heisey
>Priority: Major
>  Labels: FQ, cache, filtercache, metrics
> Attachments: cacheinfo-1.patch, cacheinfo-2.patch, cacheinfo.patch, 
> fix_92_startup.patch
>
>
> It would be very helpful to be able to inspect the contents of the 
> filterCache.
> I'd like to be able to query something like
> {{/admin/caches?type=filter&nentries=1000&sort=numHits+DESC}}
> nentries would be allowed to be -1 to get everything.
> It would be nice to see these data items for each entry. I don't know which 
> are available, but I'm thinking blue sky here:
>  * cache key, exactly as stored
>  * Timestamp when the entry was inserted
>  * Whether the insertion of the entry evicted another entry, and if so which 
> one
>  * Timestamp of when this entry was last hit
>  * Number of hits on this entry forever
>  * Number of hits on this entry over some time period
>  * Number of documents matched by the filter
>  * Number of bytes of memory used by the filter
> These are the sorts of questions I'd like to be able answer:
>  * "I just did a query that I expect will have added a cache entry. Did it?"
>  * "Are my queries hitting existing cache entries?"
>  * "How big should I set my filterCache size? Should I limit it by number of 
> entries or RAM usage?"
>  * "Which of my FQs are getting used the most? These are the ones I want in 
> my firstSearcher queries." (I currently determine this by processing my old 
> solr logs)
>  * "Which filters give me the most bang for the buck in terms of RAM usage?"
>  * "I have filter X and filter Y, but would it be beneficial if I made a 
> filter X AND Y?"
>  * "Which FQs are used more at certain times of the day? (Assuming I take 
> regular snapshots throughout the day)"
> I imagine a response might look like:
> {{{}}
> {{  "responseHeader": {}}
> {{    "status": 0,}}
> {{    "QTime": 961}}
> {{  },}}
> {{  "response": {}}
> {{    "numFound": 12104,}}
> {{    "filterCacheKeys": {}}
> {{      [}}
> {{        "language:eng": {}}
> {{          "inserted": "2021-12-04T07:34:16Z",}}
> {{          "lastHit": "2021-12-04T18:17:43Z",}}
> {{          "numHits": 15065,}}
> {{          "numHitsInPastHour": 2319,}}
> {{          "evictedKey": "agelevel:4 shippable:Y",}}
> {{          "numRecordsMatchedByFilter": 24328753,}}
> {{          "bytesUsed": 3041094}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "is_set:N": {}}
> {{          ...}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "language:spa": {}}
> {{          ...}}
> {{        }}}
> {{      ]}}
> {{    }}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-15859) Add handler to dump filter cache

2022-12-21 Thread Michael Gibney (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650990#comment-17650990
 ] 

Michael Gibney commented on SOLR-15859:
---

I think it might be possible (and preferable?) to implement this as a custom 
{{SolrCache}} implementation that wraps {{solr.CaffeineCache>}}. I think [~ben.manes] was alluding to something like this 
"MetadataWrapper" approach in his [comment above|#comment-17633401].

I've actually done something similar, and it can work quite well. It can be a 
bit tricky, but I think the "per-entry stats" part would be pretty 
straightforward done this way, and I really like the idea of implementing this 
functionality without modifying the hot path of what's currently the 
default/only cache implementation bundled with Solr. I think the only necessary 
modification to the existing {{solr.CaffeineCache}} class would be to provide a 
hook to actually dump the values, e.g., add them to a provided map, or 
something (so as not to actually expose the internals)?

I do think the functionality you're pursuing with this could be useful. One 
benefit of implementing as I'm suggesting above, I think this functionality 
would be almost entirely pluggable (as in, plugins) -- aside from some 
interface for actually dumping a snapshot of the contents of the cache, which I 
suspect would indeed need a public method added to {{solr.CaffeineCache}}.

I would definitely recommend avoiding top-level {{synchronized (cache)}} -- and 
I don't think that would be necessary if pursuing the "wrapping" approach.

Maybe a more tightly-scoped change that ignores for now the request handler and 
stats tracking, and instead focuses on figuring out a clean (if perhaps 
experimental?) method/interface for dumping the contents of 
{{solr.CaffeineCache}}? I suspect that would be easier to merge with 
confidence, and would open the door to iterate on different ways of achieving 
some of the more nuanced functionality.

> Add handler to dump filter cache
> 
>
> Key: SOLR-15859
> URL: https://issues.apache.org/jira/browse/SOLR-15859
> Project: Solr
>  Issue Type: Improvement
>Reporter: Andy Lester
>Assignee: Shawn Heisey
>Priority: Major
>  Labels: FQ, cache, filtercache, metrics
> Attachments: cacheinfo-1.patch, cacheinfo-2.patch, cacheinfo.patch, 
> fix_92_startup.patch
>
>
> It would be very helpful to be able to inspect the contents of the 
> filterCache.
> I'd like to be able to query something like
> {{/admin/caches?type=filter&nentries=1000&sort=numHits+DESC}}
> nentries would be allowed to be -1 to get everything.
> It would be nice to see these data items for each entry. I don't know which 
> are available, but I'm thinking blue sky here:
>  * cache key, exactly as stored
>  * Timestamp when the entry was inserted
>  * Whether the insertion of the entry evicted another entry, and if so which 
> one
>  * Timestamp of when this entry was last hit
>  * Number of hits on this entry forever
>  * Number of hits on this entry over some time period
>  * Number of documents matched by the filter
>  * Number of bytes of memory used by the filter
> These are the sorts of questions I'd like to be able answer:
>  * "I just did a query that I expect will have added a cache entry. Did it?"
>  * "Are my queries hitting existing cache entries?"
>  * "How big should I set my filterCache size? Should I limit it by number of 
> entries or RAM usage?"
>  * "Which of my FQs are getting used the most? These are the ones I want in 
> my firstSearcher queries." (I currently determine this by processing my old 
> solr logs)
>  * "Which filters give me the most bang for the buck in terms of RAM usage?"
>  * "I have filter X and filter Y, but would it be beneficial if I made a 
> filter X AND Y?"
>  * "Which FQs are used more at certain times of the day? (Assuming I take 
> regular snapshots throughout the day)"
> I imagine a response might look like:
> {{{}}
> {{  "responseHeader": {}}
> {{    "status": 0,}}
> {{    "QTime": 961}}
> {{  },}}
> {{  "response": {}}
> {{    "numFound": 12104,}}
> {{    "filterCacheKeys": {}}
> {{      [}}
> {{        "language:eng": {}}
> {{          "inserted": "2021-12-04T07:34:16Z",}}
> {{          "lastHit": "2021-12-04T18:17:43Z",}}
> {{          "numHits": 15065,}}
> {{          "numHitsInPastHour": 2319,}}
> {{          "evictedKey": "agelevel:4 shippable:Y",}}
> {{          "numRecordsMatchedByFilter": 24328753,}}
> {{          "bytesUsed": 3041094}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "is_set:N": {}}
> {{          ...}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "language:spa": {}}
> {{          ...}}
> {{        }}}
> {{      ]}}
> {{    }}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820

[jira] [Commented] (SOLR-15859) Add handler to dump filter cache

2022-12-21 Thread Shawn Heisey (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650984#comment-17650984
 ] 

Shawn Heisey commented on SOLR-15859:
-

[~ben.manes] Thank you for all the patience you have shown.  I don't know if 
you remember, but you once helped me with a non-Solr question I had, where you 
pointed me at the Striped class in Guava ... which is exactly what I needed and 
extremely cool.

https://github.com/google/guava/wiki/StripedExplained

Do you think you could help me with the most efficient way to implement the 
extra stats in CaffeineCache that I need to make this cache dumper give really 
useful info, so I can be sure it uses the least memory possible and is 
completely bulletproof?  I'm willing to put in the work writing it, I just need 
a little bit of a nudge to find the right way to go about it.

> Add handler to dump filter cache
> 
>
> Key: SOLR-15859
> URL: https://issues.apache.org/jira/browse/SOLR-15859
> Project: Solr
>  Issue Type: Improvement
>Reporter: Andy Lester
>Assignee: Shawn Heisey
>Priority: Major
>  Labels: FQ, cache, filtercache, metrics
> Attachments: cacheinfo-1.patch, cacheinfo-2.patch, cacheinfo.patch, 
> fix_92_startup.patch
>
>
> It would be very helpful to be able to inspect the contents of the 
> filterCache.
> I'd like to be able to query something like
> {{/admin/caches?type=filter&nentries=1000&sort=numHits+DESC}}
> nentries would be allowed to be -1 to get everything.
> It would be nice to see these data items for each entry. I don't know which 
> are available, but I'm thinking blue sky here:
>  * cache key, exactly as stored
>  * Timestamp when the entry was inserted
>  * Whether the insertion of the entry evicted another entry, and if so which 
> one
>  * Timestamp of when this entry was last hit
>  * Number of hits on this entry forever
>  * Number of hits on this entry over some time period
>  * Number of documents matched by the filter
>  * Number of bytes of memory used by the filter
> These are the sorts of questions I'd like to be able answer:
>  * "I just did a query that I expect will have added a cache entry. Did it?"
>  * "Are my queries hitting existing cache entries?"
>  * "How big should I set my filterCache size? Should I limit it by number of 
> entries or RAM usage?"
>  * "Which of my FQs are getting used the most? These are the ones I want in 
> my firstSearcher queries." (I currently determine this by processing my old 
> solr logs)
>  * "Which filters give me the most bang for the buck in terms of RAM usage?"
>  * "I have filter X and filter Y, but would it be beneficial if I made a 
> filter X AND Y?"
>  * "Which FQs are used more at certain times of the day? (Assuming I take 
> regular snapshots throughout the day)"
> I imagine a response might look like:
> {{{}}
> {{  "responseHeader": {}}
> {{    "status": 0,}}
> {{    "QTime": 961}}
> {{  },}}
> {{  "response": {}}
> {{    "numFound": 12104,}}
> {{    "filterCacheKeys": {}}
> {{      [}}
> {{        "language:eng": {}}
> {{          "inserted": "2021-12-04T07:34:16Z",}}
> {{          "lastHit": "2021-12-04T18:17:43Z",}}
> {{          "numHits": 15065,}}
> {{          "numHitsInPastHour": 2319,}}
> {{          "evictedKey": "agelevel:4 shippable:Y",}}
> {{          "numRecordsMatchedByFilter": 24328753,}}
> {{          "bytesUsed": 3041094}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "is_set:N": {}}
> {{          ...}}
> {{        }}}
> {{      ],}}
> {{      [}}
> {{        "language:spa": {}}
> {{          ...}}
> {{        }}}
> {{      ]}}
> {{    }}}
> {{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1362008815

   here a rough pull request just to give you the idea @dsmiley :
   https://github.com/apache/solr/pull/1246/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti opened a new pull request, #1246: Jira/solr 16567 tentative

2022-12-21 Thread GitBox



alessandrobenedetti opened a new pull request, #1246:
URL: https://github.com/apache/solr/pull/1246

   just for @dsmiley to have the idea of what I meant


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361889301

   Ok, I'll produce another branch with the example code assuming Lucene 
changes are there.
   
   So what I am trying to accomplish:
   1) Knn Query has a Query filter as a constructor parameter (and instance 
variable). This filter is meant to be a pre-filter(in Approximate Nearest 
Neighbour search means a filter that is executed before the Top K nearest 
neighbors are returned).
   It is used internally in the approximate nearest neighbor search Lucene code 
to only accept certain neighbors from the graph(along with the alive bitSet)
   2) In Apache Solr we need to make sure that all filter queries except 
explicit post-filters are processed and set in the Knn Query.
   3) But parsing happens before the Searcher will process the filters. So if 
are able to modify the Lucene KnnQuery in 
org.apache.solr.search.QueryUtils#combineQueryAndFilter (or create a new one), 
we are done.
   
   Right now we process the filters at parsing time (ad do it again in the 
Searcher).
   Potentially we could process and remove the filters from the request in the 
query parser, but it seems nasty to me, hence my idea of modifying the 
combineQueryAndFilter.
   Because that method has the responsibility of building a new Query, 
combining the main Query and all filters(except post-filters).
   So it seems the perfect place for implementing the custom logic for the 
KnnQuery, that behaves differently when you combine it with filters.
   
   Hope this helps with context @dsmiley !
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] dsmiley commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



dsmiley commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361863063

   A special case nearly anywhere (except directly in KNN oriented code of 
course) is a design/maintenance issue.  Some special cases like MatchAllDocs 
are understandable but a check for KNN in QueryUtils... eh... :-/Maybe you 
could show in a new PR what this would look like so I could see.  Perhaps when 
I understand better what you are trying to accomplish, I'll see a better 
solution.
   
   > (me:) Are you trying to basically move certain FQs out of their top level 
position and into/embedded in a particular parsed query?
   
   Could you respond to that please?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361848344

   And yes, in the workaround I can use the getProcessedFilters and I will, but 
once we have the Lucene side it will go away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361847032

   Hi @dsmiley ,
   once we have the Lucene changes, the idea is to change this method:
   org.apache.solr.search.QueryUtils#combineQueryAndFilter
   with an additional if clause:
   `else {
   return new BooleanQuery.Builder()
   .add(scoreQuery, Occur.MUST)
   .add(filterQuery, Occur.FILTER)
   .build();
 }`
 
 will become:
 `else if{scoreQuery instanceof KnnVectorQuery}{
 return new KnnVectorQuery(scoreQuery,filterQuery);} 
 else{
   return new BooleanQuery.Builder()
   .add(scoreQuery, Occur.MUST)
   .add(filterQuery, Occur.FILTER)
   .build();
 }`
   
   Just to give an idea, the final code will look different as we will have to 
create a new instance of KnnVectorQuery using the getters of the old one.
   With this change, we will be able to simplify the KnnQueryParser removing 
all the stuff for pre-filters and post-filters.
   GetProcessedFilters will be called just once as usual and we'll get the 
benefit of caching and post-0filter separation automatically.
   SolrIndexSearcher won't be touched at all.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16567) java.lang.StackOverflowError when combining KnnQParser and FunctionRangeQParser

2022-12-21 Thread Alessandro Benedetti (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650966#comment-17650966
 ] 

Alessandro Benedetti commented on SOLR-16567:
-

Just a mistake from intellj! Removed the branch from the upstream repo! 


> java.lang.StackOverflowError when combining KnnQParser and 
> FunctionRangeQParser
> ---
>
> Key: SOLR-16567
> URL: https://issues.apache.org/jira/browse/SOLR-16567
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
> Environment: Solr Cloud with `solr:9.1` Docker image
>Reporter: Gabriel Magno
>Priority: Major
> Attachments: create_example-solr_9_0.sh, create_example-solr_9_1.sh, 
> error_full.txt, response-error.json, run_query.sh
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hello there!
> I had a Solr 9.0 cluster running, using the new Dense Vector feature. 
> Recently I have migrated to Solr 9.1. Most of the things are working fine, 
> except for a special case I have here.
> *Error Description*
> The problem happens when I try making an Edismax query with a KNN sub-query 
> and a Function Range filter. For example, I try making this query.
>  * defType=edismax
>  * df=name
>  * q=the
>  * similarity_vector=\{!knn f=vector topK=10}[1.1,2.2,3.3,4.4]
>  * {!frange l=0.99}$similarity_vector
> In other words, I want all the documents matching the term "the" in the 
> "name" field, and I filter to return only documents having a vector 
> similarity of at least 0.99. This query was working fine on Solr 9.0, but on 
> Solr 9.1, I get his error:
>  
> {code:java}
> java.lang.RuntimeException: java.lang.StackOverflowErrorat 
> org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:840)at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:641)at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda/usr/bin/zsh(SolrDispatchFilter.java:218)
> at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
> at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227)  
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) 
>... (manually supressed for brevity)at 
> java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.lang.StackOverflowErrorat 
> org.apache.solr.search.StrParser.getId(StrParser.java:172)at 
> org.apache.solr.search.StrParser.getId(StrParser.java:168)at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:100)   
>  at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:65)
> at org.apache.solr.search.QParser.getParser(QParser.java:364)at 
> org.apache.solr.search.QParser.getParser(QParser.java:334)at 
> org.apache.solr.search.QParser.getParser(QParser.java:321)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:244)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
> at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)  
>   at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionRangeQParserPlugin.parse(FunctionRangeQParserPlugin.java:53)
> at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:246)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
> at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)  
>   at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionRangeQParserPlugin.parse(FunctionRangeQParserPlugin.java:53)
> at org.apache.solr.search.QParser.getQuery(QParser.java:188)... 
> (manually supressed for brevity){code}
>  
> The backtrace is much bigger, I'm attaching the raw Solr response in JS

[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650967#comment-17650967
 ] 

Rudi Seitz commented on SOLR-16594:
---

This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Create a subclass of org.apache.lucene.index.Term that is capable of holding 
a startOffset. Possibly name it TermWithOffset
 # Update or subclass org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains one or more TermWithOffset 
instead of simple Terms, where appropriate. This is the place where we iterate 
through the token stream and have access to the offsets to potentially store 
them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, and then checking whether 
> all of these per-field queries have the same structure. The structure will 
> generally be uniform if each analysis chain emits the same number of tokens 
> for the given input. If one chain has a synonym filter and another doesn’t, 
> this uniformity may depend on whether a synonym rule happened to match a term 
> in the user's input.
> Assuming the per-field boolean queries _do_ have the same structure, edismax 
> reorganizes them into a new boolean query. The new query contains a dismax 
> for each clause position in the original queries. If the original queries are 
> {{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
> each, so we would get a dismax containing all the first position clauses 
> {{(f1:foo f1:bar)}} and another dismax containing all the

[GitHub] [solr] alessandrobenedetti closed pull request #129: SOLR-15407 untokenized field type with sow=false fix + tests

2022-12-21 Thread GitBox



alessandrobenedetti closed pull request #129: SOLR-15407 untokenized field type 
with sow=false fix + tests
URL: https://github.com/apache/solr/pull/129


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16567) java.lang.StackOverflowError when combining KnnQParser and FunctionRangeQParser

2022-12-21 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650961#comment-17650961
 ] 

David Smiley commented on SOLR-16567:
-

The above comment by ASF's git integration shows a commit was done to a branch 
{{jira/SOLR-16567}}.  [~abenedetti] feature/bug branches on our main repo is 
generally not needed unless there's going to be broad collaboration.  Even repo 
committers (like me) can contribute to a PR branch you keep on your fork (a 
GitHub feature).  Extraneous branches pollute the view and ultimately need 
grooming.  I see you already have [a fork with a branch there for this 
PR|https://github.com/SeaseLtd/solr/tree/jira/SOLR-16567] so I'm confused why 
this {{jira/SOLR-16567}} branch is here as well (redundant).

> java.lang.StackOverflowError when combining KnnQParser and 
> FunctionRangeQParser
> ---
>
> Key: SOLR-16567
> URL: https://issues.apache.org/jira/browse/SOLR-16567
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
> Environment: Solr Cloud with `solr:9.1` Docker image
>Reporter: Gabriel Magno
>Priority: Major
> Attachments: create_example-solr_9_0.sh, create_example-solr_9_1.sh, 
> error_full.txt, response-error.json, run_query.sh
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hello there!
> I had a Solr 9.0 cluster running, using the new Dense Vector feature. 
> Recently I have migrated to Solr 9.1. Most of the things are working fine, 
> except for a special case I have here.
> *Error Description*
> The problem happens when I try making an Edismax query with a KNN sub-query 
> and a Function Range filter. For example, I try making this query.
>  * defType=edismax
>  * df=name
>  * q=the
>  * similarity_vector=\{!knn f=vector topK=10}[1.1,2.2,3.3,4.4]
>  * {!frange l=0.99}$similarity_vector
> In other words, I want all the documents matching the term "the" in the 
> "name" field, and I filter to return only documents having a vector 
> similarity of at least 0.99. This query was working fine on Solr 9.0, but on 
> Solr 9.1, I get his error:
>  
> {code:java}
> java.lang.RuntimeException: java.lang.StackOverflowErrorat 
> org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:840)at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:641)at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda/usr/bin/zsh(SolrDispatchFilter.java:218)
> at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
> at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227)  
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) 
>... (manually supressed for brevity)at 
> java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.lang.StackOverflowErrorat 
> org.apache.solr.search.StrParser.getId(StrParser.java:172)at 
> org.apache.solr.search.StrParser.getId(StrParser.java:168)at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:100)   
>  at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:65)
> at org.apache.solr.search.QParser.getParser(QParser.java:364)at 
> org.apache.solr.search.QParser.getParser(QParser.java:334)at 
> org.apache.solr.search.QParser.getParser(QParser.java:321)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:244)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
> at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)  
>   at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionRangeQParserPlugin.parse(FunctionRangeQParserPlugin.java:53)
> at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:246)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apac

[GitHub] [solr] dsmiley commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



dsmiley commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361766495

   Thanks for your complements.
   
   Can't you simply use SolrIndexSearcher#getProcessedFilter now?
   
   As to your proposal, I am confused as to exactly where you propose inserting 
the logic you provided a snippet of.  If you propose SolrIndexSearcher 
somewhere then I don't like it because it's clearly special casing a specific 
query which is a design problem.  Are you trying to basically *move* certain 
FQs out of their top level position and into/embedded in a particular parsed 
query?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054663494


##
solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java:
##
@@ -261,9 +259,10 @@ private static DocCollection collectionFromObjects(
   if (log.isDebugEnabled()) {
 log.debug("a collection {} has per-replica state", name);
   }
-  // this collection has replica states stored outside
-  ReplicaStatesProvider rsp = REPLICASTATES_PROVIDER.get();
-  if (rsp instanceof StatesProvider) ((StatesProvider) 
rsp).isPerReplicaState = true;
+} else {
+  // prior to this call, PRS provider is set. We should unset it before
+  // deserializing the replicas and slices
+  DocCollection.clearReplicaStateProvider();

Review Comment:
   To my understand this is required as otherwise the Provider might interfere 
and overrides the input values here?
   
   I agree that using ThreadLocals could avoid modification of method 
signatures as you pointed out, but I also share similar concern as @hiteshk25 
that it's a bit hard to track code flow with ThreadLocal as it requires 
"internal knowledge" of the code in order to know where things get 
added/modified. (since the method signature/contract no longer suggest the 
"full input", and we might start doing fetching at places that used to be only 
assigning fields locally etc)
   
   This invocation of `clearReplicateStateProvider` could be one of the places 
that could be hard for devs that are not familiar with the ThreadLocal to 
reason.
   
   I do understand the goal of this PR is NOT the removal of threadlocal usage 
😊 , it would be nice though to consider other designs as a replacement of 
Threadlocal (in the future!). That could include bigger refactoring 
(subclassing ClusterState that includes PRS, or adding overloading method etc). 
   
   For the moment, more comments like these to explain the rational could be 
very helpful (which this comment has already done a pretty good job, but 
perhaps it could also mention how the threadlocal PRS provider could override 
the values if not cleared ?)  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054663494


##
solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java:
##
@@ -261,9 +259,10 @@ private static DocCollection collectionFromObjects(
   if (log.isDebugEnabled()) {
 log.debug("a collection {} has per-replica state", name);
   }
-  // this collection has replica states stored outside
-  ReplicaStatesProvider rsp = REPLICASTATES_PROVIDER.get();
-  if (rsp instanceof StatesProvider) ((StatesProvider) 
rsp).isPerReplicaState = true;
+} else {
+  // prior to this call, PRS provider is set. We should unset it before
+  // deserializing the replicas and slices
+  DocCollection.clearReplicaStateProvider();

Review Comment:
   To my understand this is required as otherwise the Provider might interfere 
and overrides the input values here?
   
   I agree that using ThreadLocals could avoid modification of method 
signatures as you pointed out, but I also share similar concern as @hiteshk25 
that it's a bit hard to track code flow with ThreadLocal as it requires 
"internal knowledge" of the code in order to know where things get 
added/modified. (since the method signature/contract no longer suggest the 
"full input", and we might start doing fetching at places that used to be only 
assigning fields locally etc)
   
   This invocation of `clearReplicateStateProvider` could be one of the places 
that could be hard for devs that are not familiar with the ThreadLocal to 
understand.
   
   I do understand the goal of this PR is NOT the removal of threadlocal usage 
😊 , it would be nice though to consider other designs as a replacement of 
Threadlocal (in the future!). That could include bigger refactoring 
(subclassing ClusterState that includes PRS, or adding overloading method etc). 
   
   For the moment, more comments like these to explain the rational could be 
very helpful (which this comment has already done a pretty good job, but 
perhaps it could also mention how the threadlocal PRS provider could override 
the values if not cleared ?)  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on PR #1242:
URL: https://github.com/apache/solr/pull/1242#issuecomment-1361754709

   > @justinrsweeney @patsonluk this is still WIP , reviews are welcome
   
   Thanks for the work @noblepaul ! Just want to confirm that there are mainly 
2 goals here:
   1. Avoid making copies of DocCollections when `copyWith` is called (related 
to PRS updates)
   2. Avoid fetching PRS states until the state is actually queried (the Lazy 
PRS provider part)
   
   Do we know the general overhead of the current designs and whether they are 
causing issues? I agree that both changes will for sure reduce resource usage!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054663494


##
solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java:
##
@@ -261,9 +259,10 @@ private static DocCollection collectionFromObjects(
   if (log.isDebugEnabled()) {
 log.debug("a collection {} has per-replica state", name);
   }
-  // this collection has replica states stored outside
-  ReplicaStatesProvider rsp = REPLICASTATES_PROVIDER.get();
-  if (rsp instanceof StatesProvider) ((StatesProvider) 
rsp).isPerReplicaState = true;
+} else {
+  // prior to this call, PRS provider is set. We should unset it before
+  // deserializing the replicas and slices
+  DocCollection.clearReplicaStateProvider();

Review Comment:
   To my understand this is required as otherwise the Provider might interfere 
and overrides the input values here?
   
   I agree that using ThreadLocals could avoid modification of method 
signatures as you pointed out, but I also share similar concern as @hiteshk25 
that it's a bit hard to track code flow with ThreadLocal as it requires 
"internal knowledge" of the code in order to know where things get 
added/modified.
   
   This invocation of `clearReplicateStateProvider` could be one of the places 
that could be hard for devs that are not familiar with the ThreadLocal to 
understand.
   
   I do understand the goal of this PR is NOT the removal of threadlocal usage 
😊 , it would be nice though to consider other designs as a replacement of 
Threadlocal (in the future!). That could include bigger refactoring 
(subclassing ClusterState that includes PRS, or adding overloading method etc). 
   
   For the moment, more comments like these to explain the rational could be 
very helpful (which this comment has already done a pretty good job, but 
perhaps it could also mention how the threadlocal PRS provider could override 
the values if not cleared ?)  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054663494


##
solr/solrj/src/java/org/apache/solr/common/cloud/ClusterState.java:
##
@@ -261,9 +259,10 @@ private static DocCollection collectionFromObjects(
   if (log.isDebugEnabled()) {
 log.debug("a collection {} has per-replica state", name);
   }
-  // this collection has replica states stored outside
-  ReplicaStatesProvider rsp = REPLICASTATES_PROVIDER.get();
-  if (rsp instanceof StatesProvider) ((StatesProvider) 
rsp).isPerReplicaState = true;
+} else {
+  // prior to this call, PRS provider is set. We should unset it before
+  // deserializing the replicas and slices
+  DocCollection.clearReplicaStateProvider();

Review Comment:
   To my understand this is required as otherwise the Provider might interfere 
and overrides the input values here?
   
   I agree that using ThreadLocals could avoid modification of method 
signatures as you pointed out, but I also share similar concern as @hiteshk25 
that it's a bit hard to track code flow with ThreadLocal as it requires 
"internal knowledge" of the code in order to know where things get 
added/modified.
   
   This invocation of `clearReplicateStateProvider` could be one of the places 
that could be hard for dev that are not familiar with the ThreadLocal to 
understand.
   
   I do understand the goal of this PR is NOT the removal of threadlocal usage 
😊 , it would be nice though to consider other designs as a replacement of 
Threadlocal. That probably would include bigger changes (subclassing 
ClusterState that includes PRS, or adding overloading method etc). For the 
moment, more comments to explain the rational would be very helpful (which this 
comment also does a pretty good job, but perhaps it could also mention how the 
threadlocal PRS provider could override the values if not cleared ?)  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] patsonluk commented on a diff in pull request #1242: SOLR-16580: Avoid making copies of DocCollection for PRS updates

2022-12-21 Thread GitBox



patsonluk commented on code in PR #1242:
URL: https://github.com/apache/solr/pull/1242#discussion_r1054654494


##
solr/solrj/src/java/org/apache/solr/common/cloud/DocCollection.java:
##
@@ -139,30 +138,10 @@ public static String getCollectionPathRoot(String coll) {
* only a replica is updated
*/
   public DocCollection copyWith(PerReplicaStates newPerReplicaStates) {

Review Comment:
   I assume we will need to modify `getSlices()` so it would return the correct 
"view" of slices from replica states too? Anyway, this is probably still WIP ! 
😊 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361661204

   Ok, I updated the PR, this is what I have done:
   1) workaround to fix the bug and renames suggested by @dsmiley 
   2) opened a pull request in Lucene to implement getters in the 
KnnVectorQuery:  https://github.com/apache/lucene/pull/12029/files
   
   Unless any additional good ideas, I would go with this now.
   Then as soon as the Lucene side is sorted out and in Solr, I would implement 
the optimal approach, removing all the redundant code and just managing the 
filters in the org.apache.solr.search.QueryUtils#combineQueryAndFilter ( it 
will be so easy and clean, it's a shame I can't do it immediately)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16567) java.lang.StackOverflowError when combining KnnQParser and FunctionRangeQParser

2022-12-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650929#comment-17650929
 ] 

ASF subversion and git services commented on SOLR-16567:


Commit e683af1c7bdece1d7100852323c0a32b12bfd566 in solr's branch 
refs/heads/jira/SOLR-16567 from Elia Porciani
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=e683af1c7bd ]

SOLR-16567: KnnQueryParser support for both pre-filters and post-filters(cost>0)


> java.lang.StackOverflowError when combining KnnQParser and 
> FunctionRangeQParser
> ---
>
> Key: SOLR-16567
> URL: https://issues.apache.org/jira/browse/SOLR-16567
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
> Environment: Solr Cloud with `solr:9.1` Docker image
>Reporter: Gabriel Magno
>Priority: Major
> Attachments: create_example-solr_9_0.sh, create_example-solr_9_1.sh, 
> error_full.txt, response-error.json, run_query.sh
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Hello there!
> I had a Solr 9.0 cluster running, using the new Dense Vector feature. 
> Recently I have migrated to Solr 9.1. Most of the things are working fine, 
> except for a special case I have here.
> *Error Description*
> The problem happens when I try making an Edismax query with a KNN sub-query 
> and a Function Range filter. For example, I try making this query.
>  * defType=edismax
>  * df=name
>  * q=the
>  * similarity_vector=\{!knn f=vector topK=10}[1.1,2.2,3.3,4.4]
>  * {!frange l=0.99}$similarity_vector
> In other words, I want all the documents matching the term "the" in the 
> "name" field, and I filter to return only documents having a vector 
> similarity of at least 0.99. This query was working fine on Solr 9.0, but on 
> Solr 9.1, I get his error:
>  
> {code:java}
> java.lang.RuntimeException: java.lang.StackOverflowErrorat 
> org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:840)at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:641)at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda/usr/bin/zsh(SolrDispatchFilter.java:218)
> at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
> at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227)  
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) 
>... (manually supressed for brevity)at 
> java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.lang.StackOverflowErrorat 
> org.apache.solr.search.StrParser.getId(StrParser.java:172)at 
> org.apache.solr.search.StrParser.getId(StrParser.java:168)at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:100)   
>  at 
> org.apache.solr.search.QueryParsing.parseLocalParams(QueryParsing.java:65)
> at org.apache.solr.search.QParser.getParser(QParser.java:364)at 
> org.apache.solr.search.QParser.getParser(QParser.java:334)at 
> org.apache.solr.search.QParser.getParser(QParser.java:321)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:244)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
> at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)  
>   at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionRangeQParserPlugin.parse(FunctionRangeQParserPlugin.java:53)
> at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.QueryUtils.parseFilterQueries(QueryUtils.java:246)
> at 
> org.apache.solr.search.neural.KnnQParser.getFilterQuery(KnnQParser.java:93)   
>  at org.apache.solr.search.neural.KnnQParser.parse(KnnQParser.java:83)at 
> org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
> at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)  
>   at org.apache.solr.search.QParser.getQuery(QParser.java:188)at 
> org.apache.solr.search.FunctionRangeQParserPlugin.parse(FunctionRangeQParse

[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and ({{{}f

[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar) }}and{{ (f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and ({{{}f

[jira] [Resolved] (SOLR-16585) All docs query with any nonzero positive start value throws NPE with "this.docs is null"

2022-12-21 Thread Michael Gibney (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney resolved SOLR-16585.
---
Fix Version/s: main (10.0)
   9.2
   Resolution: Fixed

> All docs query with any nonzero positive start value throws NPE with 
> "this.docs is null"
> 
>
> Key: SOLR-16585
> URL: https://issues.apache.org/jira/browse/SOLR-16585
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
>Reporter: Shawn Heisey
>Assignee: Michael Gibney
>Priority: Major
> Fix For: main (10.0), 9.2, 9.1.1
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> An all docs query that has a nonzero positive value in the start parameter 
> will throw an NPE.  Below is a slightly redacted query sent by the admin UI 
> and  the exception.  This is from 9.2.0-SNAPSHOT installed as a service on 
> Ubuntu, a user reported the problem on solr-user with the  9.1.0 docker image.
> {code:none}
> http://server:port/solr/corename/select?indent=true&q.op=OR&q=*%3A*&rows=10&start=1&useParams={code}
> {code:none}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.solr.search.DocList.iterator()" because "this.docs" is null at 
> org.apache.solr.response.DocsStreamer.(DocsStreamer.java:74) at 
> org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
>  at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
>  at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:196)
>  at org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:47) at 
> org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:117) at 
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:30) 
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:71)
>  at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:980) 
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:585) at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:251)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219)
>  at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
>  at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) 
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>  at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1383)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1305)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
>  at 
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:301)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
>  at 
> org.eclip

[jira] [Updated] (SOLR-16585) All docs query with any nonzero positive start value throws NPE with "this.docs is null"

2022-12-21 Thread Michael Gibney (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gibney updated SOLR-16585:
--
Fix Version/s: 9.1.1

> All docs query with any nonzero positive start value throws NPE with 
> "this.docs is null"
> 
>
> Key: SOLR-16585
> URL: https://issues.apache.org/jira/browse/SOLR-16585
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
>Reporter: Shawn Heisey
>Assignee: Michael Gibney
>Priority: Major
> Fix For: 9.1.1
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> An all docs query that has a nonzero positive value in the start parameter 
> will throw an NPE.  Below is a slightly redacted query sent by the admin UI 
> and  the exception.  This is from 9.2.0-SNAPSHOT installed as a service on 
> Ubuntu, a user reported the problem on solr-user with the  9.1.0 docker image.
> {code:none}
> http://server:port/solr/corename/select?indent=true&q.op=OR&q=*%3A*&rows=10&start=1&useParams={code}
> {code:none}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.solr.search.DocList.iterator()" because "this.docs" is null at 
> org.apache.solr.response.DocsStreamer.(DocsStreamer.java:74) at 
> org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
>  at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
>  at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:196)
>  at org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:47) at 
> org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:117) at 
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:30) 
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:71)
>  at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:980) 
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:585) at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:251)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219)
>  at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
>  at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) 
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>  at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1383)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1305)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
>  at 
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:301)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>

[GitHub] [solr] magibney commented on pull request #1236: SOLR-16585: Fix NPE in MatchAllDocs pagination

2022-12-21 Thread GitBox



magibney commented on PR #1236:
URL: https://github.com/apache/solr/pull/1236#issuecomment-1361455049

   Thanks everyone; committed and backported to `branch_9x` and `branch_9_1`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16585) All docs query with any nonzero positive start value throws NPE with "this.docs is null"

2022-12-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650906#comment-17650906
 ] 

ASF subversion and git services commented on SOLR-16585:


Commit 8093e782eeef212f2d978aaf79e2a8d0aacba6bb in solr's branch 
refs/heads/branch_9_1 from Michael Gibney
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=8093e782eee ]

SOLR-16585: Fix NPE in MatchAllDocs pagination (#1236)

(cherry picked from commit ced26f7132a4162dd7eaa96de2c87712bd8525fa)


> All docs query with any nonzero positive start value throws NPE with 
> "this.docs is null"
> 
>
> Key: SOLR-16585
> URL: https://issues.apache.org/jira/browse/SOLR-16585
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
>Reporter: Shawn Heisey
>Assignee: Michael Gibney
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> An all docs query that has a nonzero positive value in the start parameter 
> will throw an NPE.  Below is a slightly redacted query sent by the admin UI 
> and  the exception.  This is from 9.2.0-SNAPSHOT installed as a service on 
> Ubuntu, a user reported the problem on solr-user with the  9.1.0 docker image.
> {code:none}
> http://server:port/solr/corename/select?indent=true&q.op=OR&q=*%3A*&rows=10&start=1&useParams={code}
> {code:none}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.solr.search.DocList.iterator()" because "this.docs" is null at 
> org.apache.solr.response.DocsStreamer.(DocsStreamer.java:74) at 
> org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
>  at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
>  at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:196)
>  at org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:47) at 
> org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:117) at 
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:30) 
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:71)
>  at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:980) 
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:585) at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:251)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219)
>  at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
>  at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) 
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>  at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1383)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1305)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
>  at 
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.je

[jira] [Commented] (SOLR-16585) All docs query with any nonzero positive start value throws NPE with "this.docs is null"

2022-12-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650905#comment-17650905
 ] 

ASF subversion and git services commented on SOLR-16585:


Commit ced26f7132a4162dd7eaa96de2c87712bd8525fa in solr's branch 
refs/heads/branch_9x from Michael Gibney
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=ced26f7132a ]

SOLR-16585: Fix NPE in MatchAllDocs pagination (#1236)

(cherry picked from commit bfccca2837e3f1625145454e75e2d602689f3781)


> All docs query with any nonzero positive start value throws NPE with 
> "this.docs is null"
> 
>
> Key: SOLR-16585
> URL: https://issues.apache.org/jira/browse/SOLR-16585
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
>Reporter: Shawn Heisey
>Assignee: Michael Gibney
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> An all docs query that has a nonzero positive value in the start parameter 
> will throw an NPE.  Below is a slightly redacted query sent by the admin UI 
> and  the exception.  This is from 9.2.0-SNAPSHOT installed as a service on 
> Ubuntu, a user reported the problem on solr-user with the  9.1.0 docker image.
> {code:none}
> http://server:port/solr/corename/select?indent=true&q.op=OR&q=*%3A*&rows=10&start=1&useParams={code}
> {code:none}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.solr.search.DocList.iterator()" because "this.docs" is null at 
> org.apache.solr.response.DocsStreamer.(DocsStreamer.java:74) at 
> org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
>  at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
>  at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:196)
>  at org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:47) at 
> org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:117) at 
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:30) 
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:71)
>  at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:980) 
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:585) at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:251)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219)
>  at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
>  at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) 
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>  at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1383)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1305)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
>  at 
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jet

[jira] [Commented] (SOLR-16585) All docs query with any nonzero positive start value throws NPE with "this.docs is null"

2022-12-21 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650903#comment-17650903
 ] 

ASF subversion and git services commented on SOLR-16585:


Commit bfccca2837e3f1625145454e75e2d602689f3781 in solr's branch 
refs/heads/main from Michael Gibney
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=bfccca2837e ]

SOLR-16585: Fix NPE in MatchAllDocs pagination (#1236)



> All docs query with any nonzero positive start value throws NPE with 
> "this.docs is null"
> 
>
> Key: SOLR-16585
> URL: https://issues.apache.org/jira/browse/SOLR-16585
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query
>Affects Versions: 9.1
>Reporter: Shawn Heisey
>Assignee: Michael Gibney
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> An all docs query that has a nonzero positive value in the start parameter 
> will throw an NPE.  Below is a slightly redacted query sent by the admin UI 
> and  the exception.  This is from 9.2.0-SNAPSHOT installed as a service on 
> Ubuntu, a user reported the problem on solr-user with the  9.1.0 docker image.
> {code:none}
> http://server:port/solr/corename/select?indent=true&q.op=OR&q=*%3A*&rows=10&start=1&useParams={code}
> {code:none}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.solr.search.DocList.iterator()" because "this.docs" is null at 
> org.apache.solr.response.DocsStreamer.(DocsStreamer.java:74) at 
> org.apache.solr.response.ResultContext.getProcessedDocuments(ResultContext.java:55)
>  at 
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:246)
>  at 
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:196)
>  at org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:47) at 
> org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:117) at 
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:30) 
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:71)
>  at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:980) 
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:585) at 
> org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:251)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219)
>  at 
> org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
>  at 
> org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) 
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>  at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at 
> org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) 
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) 
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223)
>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1383)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) 
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1305)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) 
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149)
>  at 
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228)
>  at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
>  at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:301)
>  at 
>

[GitHub] [solr] magibney merged pull request #1236: SOLR-16585: Fix NPE in MatchAllDocs pagination

2022-12-21 Thread GitBox



magibney merged PR #1236:
URL: https://github.com/apache/solr/pull/1236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:37 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents 3 and 4 are returned for both versions 
of qf – there is no change in recall when we add field2_txt to qf. That is 
because there is no synonym rule for GC, so even though ws and txt fields have 
"incompatible" analysis chains they happen to generate the same number of 
tokens for this particular query and edismax is able to stay with the 
term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:34 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

 

{{{"id":"1", "field1_ws":"XY GB"}}}

{{{}{"id":"2", "field1_ws":"XY", "field2_ws":"GB", 
"field2_txt":"GB"}{}}}{{{}{}}}

{{{"id":"3", "field1_ws":"XY GC"}}}

{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

 

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true

[GitHub] [solr-operator] tiimbz commented on issue #483: Servicemonitor for prometheus exporter is referring to cluster port instead of metrics pod port

2022-12-21 Thread GitBox



tiimbz commented on issue #483:
URL: https://github.com/apache/solr-operator/issues/483#issuecomment-1361389034

   Looking at the code, it looks like the `prometheus.io/port` value is set 
from `ExtSolrMetricsPort`, not `SolrMetricsPort` which would have fixed the 
problem.
   
   Any attempts to overwrite this by using custom `serviceAnnotations` is not 
working, as custom annotations can only supplement the default ones, not 
overwrite them: 
https://github.com/apache/solr-operator/blob/main/controllers/util/prometheus_exporter_util.go#L400


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:32 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

 

{{{"id":"1", "field1_ws":"XY GB"}}}

{{{}{"id":"2", "field1_ws":"XY", "field2_ws":"GB", 
"field2_txt":"GB"}{}}}{{{}{}}}

{{{"id":"3", "field1_ws":"XY GC"}}}

{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

 

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create collection using the default schema and index the following documents:

{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the _ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though _ws and _txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow

[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ] 

Rudi Seitz commented on SOLR-16594:
---

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create collection using the default schema and index the following documents:

{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the _ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though _ws and _txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from t

[GitHub] [solr-operator] tiimbz commented on issue #483: Servicemonitor for prometheus exporter is referring to cluster port instead of metrics pod port

2022-12-21 Thread GitBox



tiimbz commented on issue #483:
URL: https://github.com/apache/solr-operator/issues/483#issuecomment-1361368911

   We are having the same issue. The `prometheus.io/port` annotation is set to 
port `80`, which doesn't correspond with the port of the pod. This causes 
Prometheus to fail to scrape the service endpoint.
   
   We've also bypassed the problem by enabling scraping of the pods directly: 
   
   ```
 customKubeOptions:
   podOptions:
 annotations:
   prometheus.io/port: "8080"
   prometheus.io/path: /metrics
   prometheus.io/scrape: "true"
   prometheus.io/scheme: http
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16556) Solr stream expression: Implement Page Streaming Decorator to allow results to be displayed with pagination.

2022-12-21 Thread Maulin (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-16556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650816#comment-17650816
 ] 

Maulin commented on SOLR-16556:
---

Hi [~jbernste] ,

Thanks for reviewing this and providing feedback.

Your understanding is correct about the start parameter. tuples will start 
flowing from the start param.

But line 229 has nothing to do with the start parameter. This for loop (line 
229 to 232) polls top rows records.

If you notice its using poll method (line 230) to poll records from "top" 
priority queue.


Here is the logic.

    /*   1. Read the stream and add N (rows+start) tuples into priority Queue.
     *   2. If new tuple from stream is greater than Tuple in 'top' priority 
Queue, replace tuple in priority Queue by new tuple.
     *   3. Add required (specified by rows param) into 'topList' Queue.
     */

Please let me know if you need more clarification on this.

Regards,

Maulin  

> Solr stream expression: Implement Page Streaming Decorator to allow results 
> to be displayed with pagination.
> 
>
> Key: SOLR-16556
> URL: https://issues.apache.org/jira/browse/SOLR-16556
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Maulin
>Priority: Major
>  Labels: Streamingexpression, decorator, paging
> Attachments: Page Decorator Performance Reading.xlsx
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr stream expression: Implement Page Streaming Decorator to allow results 
> to be displayed with pagination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] bruno-roustant commented on pull request #1215: DocRouter: strengthen abstraction

2022-12-21 Thread GitBox



bruno-roustant commented on PR #1215:
URL: https://github.com/apache/solr/pull/1215#issuecomment-1361142463

   @noblepaul I don't fully understand your point RE _"Introducing a new type 
of Router needs more reviews"_.
   To me this PR does not add a new type of DocRouter, as it is centered around 
CompositeIdRouter. It makes it clearer that only CompositeIdRouter is supported 
for split operations in SplitOp, SolrIndexSplitter, MigrateCmd. I see no new 
code, only some refactoring to move key parsing logic inside CompositeIdRouter. 
Do I miss something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] bruno-roustant commented on a diff in pull request #1215: DocRouter: strengthen abstraction

2022-12-21 Thread GitBox



bruno-roustant commented on code in PR #1215:
URL: https://github.com/apache/solr/pull/1215#discussion_r1054213209


##
solr/core/src/java/org/apache/solr/cloud/api/collections/MigrateCmd.java:
##
@@ -253,7 +252,7 @@ private void migrateKey(
 SHARD_ID_PROP,
 sourceSlice.getName(),
 "routeKey",
-SolrIndexSplitter.getRouteKey(splitKey) + "!",
+sourceRouter.getRouteKeyNoSuffix(splitKey) + "!",

Review Comment:
   Above, in the existing code, there is a cast to CompositeIdRouter. It is not 
clear to me if it is known that the router must be of type CompositeIdRouter. 
Should we add a check like checkRouterSupportsSplitKey()?



##
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##
@@ -765,18 +766,11 @@ static FixedBitSet[] split(
 return docSets;
   }
 
-  public static String getRouteKey(String idString) {
-int idx = idString.indexOf(CompositeIdRouter.SEPARATOR);
-if (idx <= 0) return null;
-String part1 = idString.substring(0, idx);
-int commaIdx = part1.indexOf(CompositeIdRouter.bitsSeparator);
-if (commaIdx > 0 && commaIdx + 1 < part1.length()) {
-  char ch = part1.charAt(commaIdx + 1);
-  if (ch >= '0' && ch <= '9') {
-part1 = part1.substring(0, commaIdx);
-  }
+  private static void checkRouterSupportsSplitKey(HashBasedRouter hashRouter, 
String splitKey) {

Review Comment:
   The expectation is much clearer with this method.
   I'm not familiar with the other DocRouter. Does this mean that split is not 
supported at all with other DocRouter types?



##
solr/core/src/java/org/apache/solr/handler/admin/SplitOp.java:
##
@@ -263,8 +263,9 @@ private void handleGetRanges(CoreAdminHandler.CallInfo it, 
String coreName) thro
 DocCollection collection = clusterState.getCollection(collectionName);
 String sliceName = 
parentCore.getCoreDescriptor().getCloudDescriptor().getShardId();
 Slice slice = collection.getSlice(sliceName);
-DocRouter router =
-collection.getRouter() != null ? collection.getRouter() : 
DocRouter.DEFAULT;
+CompositeIdRouter router =

Review Comment:
   As I understand, the router was 'expected' to be a CompositeIdRouter, even 
if not clear here, because of the code below manipulating the terms and 
expecting to find a CompositeIdRouter.SEPARATOR.
   It becomes clearer.
   Should we add a check like checkRouterSupportsSplitKey() with a clearer 
exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] alessandrobenedetti commented on pull request #1245: SOLR-16567: KnnQueryParser support for both pre-filters and post-filter

2022-12-21 Thread GitBox



alessandrobenedetti commented on PR #1245:
URL: https://github.com/apache/solr/pull/1245#issuecomment-1361107904

   > I'm looking closer at KnnQParser; ...
   
   David, your help has been invaluable!
   You are absolutely right and this is an oversight of mine when I did the 
original review of the pre-filtering work(this naming is quite common in the 
neural search community).
   
   I think that I have found the perfect place for this fix and it literally 
would require few lines of code, rather than the complicate methods that are in 
place now:
   
   org.apache.solr.search.QueryUtils#combineQueryAndFilter
   `
   ...
   } else if(scoreQuery instanceof KnnVectorQuery){
   (KnnVectorQuery)scoreQuery.setFilter(filterQuery);
 }
 else {
   return new BooleanQuery.Builder()
   .add(scoreQuery, Occur.MUST)
   .add(filterQuery, Occur.FILTER)
   .build();
 }
   `
   Basically when we combine the query with the filter, we manage the thing 
differently for KNN queries, and the filter(excluding post filters) is set in 
the Lucene KnnVectorQuery.
   
   So far so good, elegant and minimal code change, filters are processed ones 
, everyone is happy...
   BUT
   currently KnnVectorQuery in Lucene has no getters and setters and has all 
variable as final!!
   This is extremely annoying but we are where we are (to be honest, for a 
library class I would have gone with private variables with getters/setters 
since the beginning).
   
   Also, now Lucene is a separate project, so I basically should do the change 
in Lucene, then wait for a Lucene release, include it in Solr ect ect 
   
   So, long story short, my suggestion:
   1) we proceed with the current hack, renaming where necessary to make it 
nicer, but no massive change
   2) contribute the change Lucene side, probably removing final, and adding 
getters/setters or if the community disagree, just adding getters so that It's 
possible to create a new class from the input one
   3) once Lucene releases and we have the code in Solr, do the nice and clean 
implementation
   
   let me know what do you think!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] bszabo97 commented on pull request #1196: SOLR-11029 Create a v2 API equivalent for DELETENODE API

2022-12-21 Thread GitBox



bszabo97 commented on PR #1196:
URL: https://github.com/apache/solr/pull/1196#issuecomment-1361092894

   Hello @gerlowskija 
   
   Thanks for the heads up and for the great description of what should be 
changed in the tests. I have added a commit which changes the test and 
implementation according to your suggestions.
   If you think there is anything around the performance blocker in which I can 
help with I am more than happy to do so! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

53 matches

Mail list logo