aruggero opened a new pull request, #1257: URL: https://github.com/apache/solr/pull/1257
https://issues.apache.org/jira/browse/SOLR-16596 # Description In some scenarios, a null value for a feature has a different meaning than a zero value. There are models that are trained aware of this behavior (e.g. https://xgboost.readthedocs.io/en/stable/faq.html#how-to-deal-with-missing-values). This contribution wants to add the possibility to differentiate the _MultipleAdditiveTrees_ models' behavior when dealing with these two feature values. With the default configuration, a null and a zero value have the same meaning. # Solution An additional "_missing_" branch parameter has been introduced to differentiate the model behavior. This defines the branch to follow when the corresponding feature value is null. To manage null values, the "_myFeatures.json_" file needs to be modified. A "_defaultValue_" parameter with a "_NaN_" value needs to be added to each feature that can assume a null value. Also, the model configuration needs two additional parameters. "_isNullSameAsZero_" needs to be defined in the model "_params_" and set to "_false_"; then the "_missing_" parameter needs to be added to each branch where the corresponding feature supports null values. This can assume one value between "_left_" and "_right_". _solr/modules/ltr/src/java/org/apache/solr/ltr/model/MultipleAdditiveTreesModel.java_ has been modified. The _IsNullSameAsZero_ variable has been introduced to declare that we want to differentiate zeros from nulls. Then the _missing_ branch has been added to the tree to define the direction to take when dealing with null values. # Tests A new _multipleadditivetreesmodel_features_with_missing_branch.json_ file and two additional _MultipleAdditiveTreesModels_ files (_multipleadditivetreesmodel_with_missing_branch.json_ and _multipleadditivetreesmodel_with_missing_branch_for_interleaving.json_) have been added to test the new capability. A new test has been added in _solr/modules/ltr/src/test/org/apache/solr/ltr/model/TestMultipleAdditiveTreesModel.java_ to test the new behavior with null values: - _testMultipleAdditiveTreesWithNulls()_ Additional tests have also been added to check the sparse/dense format behavior when dealing with null values. In _solr/modules/ltr/src/test/org/apache/solr/ltr/response/transform/TestFeatureLoggerTransformer.java_ the new tests are: - _featureTransformer_shouldWorkInSparseFormat_withNulls()_ - _featureTransformer_shouldWorkInDenseFormat_withNulls()_ - _interleaving_featureTransformer_shouldWorkInSparseFormat_withNulls()_ - _interleaving_featureTransformer_shouldWorkInDenseFormat_withNulls()_ For those features with a default value of NaN, in the sparse format, we would like to see also zero values (since they are not the default ones). # Checklist - [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [X] I have created a Jira issue and added the issue ID to my pull request title. - [X] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [X] I have developed this patch against the `main` branch. - [X] I have run `./gradlew check`. - [X] I have added tests for my changes. - [X] I have added documentation for the [Reference Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org