[GitHub] [solr] cpoerschke commented on a change in pull request #166: SOLR-15450 add PrefetchingFieldValueFeature

GitBox Mon, 14 Jun 2021 01:08:54 -0700


cpoerschke commented on a change in pull request #166:
URL: https://github.com/apache/solr/pull/166#discussion_r650443475




##########
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/PrefetchingFieldValueFeature.java
##########
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.ltr.feature;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.schema.SchemaField;
+import org.apache.solr.search.SolrDocumentFetcher;
+import org.apache.solr.search.SolrIndexSearcher;
+
+import java.io.IOException;
+import java.util.LinkedHashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * This feature returns the value of a field in the current document.
+ * The field must have stored="true" or docValues="true" properties.
+ * Example configuration:
+ * <pre>{
+  "name":  "rawHits",
+  "class": "org.apache.solr.ltr.feature.FieldValueFeature",
+  "params": {
+      "field": "hits"
+  }
+}</pre>
+ */
+public class PrefetchingFieldValueFeature extends FieldValueFeature {
+  // used to store all fields from all PrefetchingFieldValueFeatures
+  private Set<String> prefetchFields;
+
+  public void setField(String field) {
+    this.field = field;
+  }
+
+  public void setPrefetchFields(Set<String> fields) {
+    prefetchFields = fields;
+  }
+
+  @Override
+  public LinkedHashMap<String,Object> paramsToMap() {
+    final LinkedHashMap<String,Object> params = defaultParamsToMap();
+    params.put("field", field);
+    return params;
+  }
+
+  @Override
+  protected void validate() throws FeatureException {
+    if (field == null || field.isEmpty()) {
+      throw new FeatureException(getClass().getSimpleName()+
+          ": field must be provided");
+    }
+  }
+
+  public PrefetchingFieldValueFeature(String name, Map<String,Object> params) {
+    super(name, params);
+  }
+
+  @Override
+  public FeatureWeight createWeight(IndexSearcher searcher, boolean 
needsScores,
+      SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi)
+          throws IOException {
+    return new PrefetchingFieldValueFeatureWeight(searcher, request, 
originalQuery, efi);
+  }
+
+  @VisibleForTesting
+  public Set<String> getPrefetchFields(){
+    return prefetchFields;
+  }
+
+  public class PrefetchingFieldValueFeatureWeight extends 
FieldValueFeatureWeight {
+    private final SchemaField schemaField;
+    private final SolrDocumentFetcher docFetcher;
+
+    public PrefetchingFieldValueFeatureWeight(IndexSearcher searcher,
+        SolrQueryRequest request, Query originalQuery, Map<String,String[]> 
efi) {
+      super(searcher, request, originalQuery, efi);
+      if (searcher instanceof SolrIndexSearcher) {
+        schemaField = ((SolrIndexSearcher) 
searcher).getSchema().getFieldOrNull(field);
+      } else {
+        schemaField = null;
+      }
+      this.docFetcher = request.getSearcher().getDocFetcher();
+    }
+
+    /**
+     * Return a FeatureScorer that work with stored fields and makes use of 
the cache if the configured field is stored
+     * and has no docValues.
+     * Otherwise, delegate the work to the FieldValueFeature.
+     *
+     * @param context the segment this FeatureScorer is working with
+     * @return FeatureScorer for the current segment and field
+     * @throws IOException as defined by abstract class Feature
+     */
+    @Override
+    public FeatureScorer scorer(LeafReaderContext context) throws IOException {
+      if (schemaField != null && !schemaField.stored() && 
schemaField.hasDocValues()) {
+        return super.scorer(context);
+      }
+      return new PrefetchingFieldValueFeatureScorer(this, context,
+          DocIdSetIterator.all(DocIdSetIterator.NO_MORE_DOCS));
+    }
+
+    /**
+     * A FeatureScorer that reads the stored value for a field
+     * docFetcher does not request a single field but all the prefetchFields 
to improve performance through caching
+     */
+    public class PrefetchingFieldValueFeatureScorer extends 
FieldValueFeatureScorer {
+
+      public PrefetchingFieldValueFeatureScorer(FeatureWeight weight,
+          LeafReaderContext context, DocIdSetIterator itr) {
+        super(weight, context, itr);
+      }
+
+      @Override
+      public float score() throws IOException {
+        try {
+          final Document document = docFetcher.doc(context.docBase + 
itr.docID(), prefetchFields);
+
+          return super.parseStoredFieldValue(document.getField(field));
+        } catch (final IOException e) {
+          throw new FeatureException(
+              e.toString() + ": " +
+                  "Unable to extract feature for "
+                  + name, e);

Review comment:
       Hi @tomglk! I really like your idea of including the prefetchFields in 
the stored representation that one gets when downloading from the feature store 
or when inspecting in ZooKeeper.
   
   In 
https://github.com/cpoerschke/solr/commit/df2864a95e1a83ee446781042fe119981551a932
 I explored splitting PrefetchingFieldValueFeature.prefetchFields into three 
(configured, computed, effective) and 'self' not being part of the 
params-to-map prefetch fields. But I am not yet 100% convinced by that 
splitting approach and the commit also includes some small others bits mixed 
in. What do you think?
   
   On your (not) overriding `equals` point, that's a good one. I think it's 
okay to map `null` to `empty set` in `paramsToMap` like you did.

##########
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/PrefetchingFieldValueFeature.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.ltr.feature;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.solr.common.util.StrUtils;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.search.SolrDocumentFetcher;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * This feature returns the value of a field in the current document.
+ * The field must have stored="true" or docValues="true" properties.
+ * Example configuration:
+ * <pre>{
+  "name":  "rawHits",
+  "class": "org.apache.solr.ltr.feature.PrefetchingFieldValueFeature",
+  "params": {
+      "field": "hits"
+  }
+}</pre>
+ */
+public class PrefetchingFieldValueFeature extends FieldValueFeature {
+  // used to store all fields from all PrefetchingFieldValueFeatures
+  private Set<String> prefetchFields;
+  // can be used for debugging to only fetch the field this features uses
+  public static final String DISABLE_PREFETCHING_FIELD_VALUE_FEATURE = 
"disablePrefetchingFieldValueFeature";
+
+  public void setPrefetchFields(Set<String> fields) {
+    prefetchFields = fields;
+  }
+
+  @VisibleForTesting
+  public Set<String> getPrefetchFields(){
+    return prefetchFields;
+  }
+
+  public PrefetchingFieldValueFeature(String name, Map<String,Object> params) {
+    super(name, params);
+  }
+
+  @Override
+  public FeatureWeight createWeight(IndexSearcher searcher, boolean 
needsScores,
+      SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi)
+          throws IOException {
+    return new PrefetchingFieldValueFeatureWeight(searcher, request, 
originalQuery, efi);
+  }
+
+  public class PrefetchingFieldValueFeatureWeight extends 
FieldValueFeatureWeight {
+    private final SolrDocumentFetcher docFetcher;
+    private final Boolean disablePrefetching;
+
+    public PrefetchingFieldValueFeatureWeight(IndexSearcher searcher,
+        SolrQueryRequest request, Query originalQuery, Map<String,String[]> 
efi) {
+      super(searcher, request, originalQuery, efi);
+
+      disablePrefetching = 
request.getParams().getBool(DISABLE_PREFETCHING_FIELD_VALUE_FEATURE, false);

Review comment:
       Hi @tomglk,
   
   All good points here re: the benefits of a disable prefetching mechanism and 
the disadvantages of try-catch-retry logic. Reading your arguments here (thanks 
for the details and structure!) and the configured/computed/effective work on 
https://github.com/cpoerschke/solr/commit/df2864a95e1a83ee446781042fe119981551a932
 generated some more insights and ideas, I will try to organise them into 
logical chunks below but in short: I agree and can see now that and how a 
prefetch disabling mechanism will be useful.
   
   > ... Your intention is something like this (in a subclass of PFVF) ...
   
   Yes, "in a subclass of PFVF" was the intention and then the details of the 
subclass could suit the particular use case i.e. could be boolean or could be 
non-boolean in some cases. But (spoiler alert) re: the next 'logical chunk' 
updates below, it seems to me now that there is a general (non-subclass) case 
for a non-boolean parameter.

##########
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/PrefetchingFieldValueFeature.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.ltr.feature;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.solr.common.util.StrUtils;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.search.SolrDocumentFetcher;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * This feature returns the value of a field in the current document.
+ * The field must have stored="true" or docValues="true" properties.
+ * Example configuration:
+ * <pre>{
+  "name":  "rawHits",
+  "class": "org.apache.solr.ltr.feature.PrefetchingFieldValueFeature",
+  "params": {
+      "field": "hits"
+  }
+}</pre>
+ */
+public class PrefetchingFieldValueFeature extends FieldValueFeature {
+  // used to store all fields from all PrefetchingFieldValueFeatures
+  private Set<String> prefetchFields;
+  // can be used for debugging to only fetch the field this features uses
+  public static final String DISABLE_PREFETCHING_FIELD_VALUE_FEATURE = 
"disablePrefetchingFieldValueFeature";
+
+  public void setPrefetchFields(Set<String> fields) {
+    prefetchFields = fields;
+  }
+
+  @VisibleForTesting
+  public Set<String> getPrefetchFields(){
+    return prefetchFields;
+  }
+
+  public PrefetchingFieldValueFeature(String name, Map<String,Object> params) {
+    super(name, params);
+  }
+
+  @Override
+  public FeatureWeight createWeight(IndexSearcher searcher, boolean 
needsScores,
+      SolrQueryRequest request, Query originalQuery, Map<String,String[]> efi)
+          throws IOException {
+    return new PrefetchingFieldValueFeatureWeight(searcher, request, 
originalQuery, efi);
+  }
+
+  public class PrefetchingFieldValueFeatureWeight extends 
FieldValueFeatureWeight {
+    private final SolrDocumentFetcher docFetcher;
+    private final Boolean disablePrefetching;
+
+    public PrefetchingFieldValueFeatureWeight(IndexSearcher searcher,
+        SolrQueryRequest request, Query originalQuery, Map<String,String[]> 
efi) {
+      super(searcher, request, originalQuery, efi);
+
+      disablePrefetching = 
request.getParams().getBool(DISABLE_PREFETCHING_FIELD_VALUE_FEATURE, false);

Review comment:
       Here's the 'logical chunks' mentioned above:
   
   * index corruption is (hopefully) a rare source of errors, and in such a 
scenario rethrowing the exception with the details would seem most useful.
   
   * missing-ness is a probably more likely potential source of errors: how 
would the document fetcher behave if one of the fields does not exist in the 
schema or it exists but is not stored and not docValues? or if the field exists 
but is optional and the document being scored has no value for the field? 
perhaps the test coverage already has the answers to this, i haven't looked, 
just thinking out aloud here.
   
   * next then moving on to potential sources of missing-ness:
     * `[effective]prefetchFields` is computed from the 
PrefetchingFieldValueFeature itself and its fellow PrefetchingFieldValueFeature 
in the same feature store
     * _if_ a feature is removed from the store and _if_ that did not propagate 
the new joint prefetchFields to the features remaining in the store then that 
would potentially be inefficient i.e. a field is unnecessarily prefetched but 
it would not be a problem as such since the prefetch is for the underlying 
field rather than the removed feature
       * aside: from what i recall we don't support removal of features from a 
store
       * aside: re-computation of the joint prefetchFields happens when a 
feature is uploaded to a feature store, the upload could be for an established 
actively used feature store. concurrent read-use of 
`PrefetchingFieldValueFeature.prefetchFields` is not a problem but one thread 
reading and another updating seems potentially problematic
   
   * when a feature is added to the store, at that point missing-ness could 
arise e.g. due to a typo in the field name or due to destination mistakes i.e. 
upload to storeA for collectionA was intended but upload to storeB for 
collectionB happened
     * if/since removal of features is not supported there is no obviously 
quick remedy here (probably deleting the feature store and re-uploading it with 
only the "good" features should work though I'm not entirely sure how any 
models using the store's "good" features would behave in that scenario, it 
probably would work whilst the models are loaded in memory but would not work 
across a reload or restart)
     * if a PrefetchingFieldValueFeature with a field named "bad1" was uploaded 
then operationally a precise "oops, sorry, please ignore that 'bad1' field" 
mechanism would be more targetted than a broader "ouch, let's temporarily turn 
off all prefetching" mechanism
     * the exception thrown could guide towards a (say) 
`ltr.PrefetchingFieldValueFeature.exclusions` string parameter (instead of the 
`disablePrefetchingFieldValueFeature` boolean parameter) and iteratively the 
user can then reduce 
`ltr.PrefetchingFieldValueFeature.exclusions=bad1,bar,foo,good1` down to 
`ltr.PrefetchingFieldValueFeature.exclusions=bad1` meaning that the features 
using the `bar,foo,good1` fields continue to benefit from prefetching
       * aside: when shrinking the `[effective]prefetchFields` based on 
`ltr.PrefetchingFieldValueFeature.exclusions=bad1,bar,foo,good1` the 
implementation should not exclude itself e.g. features using the `good1` field 
must disregard the `good1` element on the 
`ltr.PrefetchingFieldValueFeature.exclusions` list
   
   * Might 
`ltr.PrefetchingFieldValueFeature.exclusions=aaa,bbb,ccc,ddd,eee,fff` have 
other uses too?
     * The effects of a feature store update could be somewhere in between 
'good' and 'bad' e.g.
       * `model1` uses features `aaa,bbb,ccc`
       * we add `fl=[fv]` to extract feature values for old and new features in 
order to build `model2`
       * `fl=[fv]` was previously fine but with the new features latency is 
more problematic
       * we complete feature extraction and stop using `fl=[fv]`
       * `model1` does not use features `ddd,eee,fff` but since all features 
are in the same store `model1`'s features will prefetch fields for 
`ddd,eee,fff` even if `fl=[fv]` is not requested
         * `ltr.PrefetchingFieldValueFeature.exclusions=ddd,eee,fff` can 
eliminate the prefetch
       * we deploy `model2` and if it only uses the `ddd,eee,fff` features then 
`ltr.PrefetchingFieldValueFeature.exclusions=aaa,bbb,ccc` can eliminate 
unnecessary prefetching of the old features' fields
   
   (I don't know if practically additional prefetch fields could work out as 
costly enough to justify the operational efforts of the outlined 
`ltr.PrefetchingFieldValueFeature.exclusions` parameter use but in principle 
the parameter could be used in that way.)
   
   ----
   
   Bit of a long write-up that, sorry, did it sort of get the ideas across or 
maybe some bits could be clarified somehow, what do you think?
   
   Oh and of course `ltr.PrefetchingFieldValueFeature.exclusions` is just a 
tentative name i.e. we can call it something else also and 
`ltr.PrefetchingFieldValueFeature.exclusions=*` equivalent to 
`disablePrefetchingFieldValueFeature=true` may also be convenient.

##########
File path: solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRScoringQuery.java
##########
@@ -223,6 +223,7 @@ public ModelWeight createWeight(IndexSearcher searcher, 
ScoreMode scoreMode, flo
     else{
       createWeightsParallel(searcher, scoreMode.needsScores(), featureWeights, 
features);
     }
+    // TODO: potential prefetchFields computation site?

Review comment:
       Hi @tomglk. I had another idea! This pull request is fun :-)
   
   The idea unexpectedly appeared triggered by the earlier insights:
   
   > ... re-computation of the joint prefetchFields ... could be for an 
established actively used feature store. concurrent ...
   
   > ... Might 
`ltr.PrefetchingFieldValueFeature.exclusions=aaa,bbb,ccc,ddd,eee,fff` have 
other uses too? ...
   
   > ... `disablePrefetchingFieldValueFeature` ... 
`ltr.PrefetchingFieldValueFeature.exclusions` ...
   
   ManagedFeatureStore computes the joint `prefetchFields` approach:
   * (+) the feature store content provides visibility into the feature store's 
overall prefetch fields set
   * (+/-) concurrency needs to be considered when updating the prefetch fields 
of actively used features
   * (-) the prefetch fields used may be 'too wide' i.e. all features in the 
store vs. all features used by the model
   
   LTRScoringQuery computes the joint `prefetchFields` approach -- the new idea:
   * (-) or (+/-) less or different visibility into the overall prefetch fields 
set (no visibility via the feature store but still visibility when an exception 
is thrown)
   * (+) no concurrency concerns since the joint prefetch fields set is 
propagated to not-yet-used PrefetchingFieldValueFeatureWeight objects
   * (+) the used prefetch fields used is 'precise' i.e. all features in the 
store vs. all features used by the model
   
   pseudo code:
   
   ```
   for (FeatureWeight fw : featureWeights) {
     if (fw instanceof PrefetchingFieldValueFeatureWeight) {
       PrefetchingFieldValueFeatureWeight pfvfw = 
(PrefetchingFieldValueFeatureWeight)fw;
       prefetchingFeatureWeights.add(pfvfw);
       prefetchFields.add(pfvfw.getField());
     }
   }
   for (PrefetchingFieldValueFeatureWeight pfvfw : prefetchingFeatureWeights) {
     prefetchFields.addPrefetchFields(prefetchFields);
   }
   ```
   
   What do you think?

##########
File path: solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRScoringQuery.java
##########
@@ -223,6 +223,7 @@ public ModelWeight createWeight(IndexSearcher searcher, 
ScoreMode scoreMode, flo
     else{
       createWeightsParallel(searcher, scoreMode.needsScores(), featureWeights, 
features);
     }
+    // TODO: potential prefetchFields computation site?

Review comment:
       Hi @tomglk. I had another idea! This pull request is fun :-)
   
   The idea unexpectedly appeared triggered by the earlier insights:
   
   > ... re-computation of the joint prefetchFields ... could be for an 
established actively used feature store. concurrent ...
   
   > ... Might 
`ltr.PrefetchingFieldValueFeature.exclusions=aaa,bbb,ccc,ddd,eee,fff` have 
other uses too? ...
   
   > ... `disablePrefetchingFieldValueFeature` ... 
`ltr.PrefetchingFieldValueFeature.exclusions` ...
   
   ManagedFeatureStore computes the joint `prefetchFields` approach:
   * (+) the feature store content provides visibility into the feature store's 
overall prefetch fields set
   * (+/-) concurrency needs to be considered when updating the prefetch fields 
of actively used features
   * (-) the prefetch fields used may be 'too wide' i.e. all features in the 
store vs. all features used by the model
   
   LTRScoringQuery computes the joint `prefetchFields` approach -- the new idea:
   * (-) or (+/-) less or different visibility into the overall prefetch fields 
set (no visibility via the feature store but still visibility when an exception 
is thrown)
   * (+) no concurrency concerns since the joint prefetch fields set is 
propagated to not-yet-used PrefetchingFieldValueFeatureWeight objects
   * (+) the used prefetch fields used is 'precise' i.e. all features in the 
store vs. all features used by the model
   
   pseudo code:
   
   ```
   for (FeatureWeight fw : featureWeights) {
     if (fw instanceof PrefetchingFieldValueFeatureWeight) {
       PrefetchingFieldValueFeatureWeight pfvfw = 
(PrefetchingFieldValueFeatureWeight)fw;
       prefetchingFeatureWeights.add(pfvfw);
       prefetchFields.add(pfvfw.getField());
     }
   }
   for (PrefetchingFieldValueFeatureWeight pfvfw : prefetchingFeatureWeights) {
     pfvfw.addPrefetchFields(prefetchFields);
   }
   ```
   
   What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] cpoerschke commented on a change in pull request #166: SOLR-15450 add PrefetchingFieldValueFeature

Reply via email to