(solr) branch main updated: SOLR-17023: Add documentation and tutorial to new ONNX model feature (#3663)

epugh Thu, 18 Sep 2025 10:22:54 -0700

This is an automated email from the ASF dual-hosted git repository.

epugh pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/solr.git



The following commit(s) were added to refs/heads/main by this push:
     new 23afb28dd9a SOLR-17023: Add documentation and tutorial to new ONNX 
model feature (#3663)
23afb28dd9a is described below

commit 23afb28dd9af139dbaba99cc9c17827aebf848a5
Author: Eric Pugh <[email protected]>
AuthorDate: Thu Sep 18 13:22:33 2025 -0400

    SOLR-17023: Add documentation and tutorial to new ONNX model feature (#3663)
    
    * Add a tutorial on this new capablity.
    * Document new feature
---
 solr/CHANGES.txt                                   |   2 +
 solr/modules/analysis-extras/README.md             |   3 +-
 .../DocumentCategorizerUpdateProcessorFactory.java |  81 +++-
 ...ExtractNamedEntitiesUpdateProcessorFactory.java |  29 +-
 .../pages/update-request-processors.adoc           |   6 +
 .../getting-started/getting-started-nav.adoc       |   1 +
 .../getting-started/pages/solr-tutorial.adoc       |   2 +-
 .../getting-started/pages/tutorial-opennlp.adoc    | 463 +++++++++++++++++++++
 .../pages/major-changes-in-solr-10.adoc            |   8 +
 9 files changed, 556 insertions(+), 39 deletions(-)

diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt
index 6f7621fcad3..ed4924a39ea 100644
--- a/solr/CHANGES.txt
+++ b/solr/CHANGES.txt
@@ -21,6 +21,8 @@ New Features
 
 * SOLR-17780: Add support for scalar quantized dense vectors (Kevin Liang via 
Alessandro Benedetti)
 
+* SOLR-17023: Use Modern NLP Models from Apache OpenNLP with Solr (Jeff 
Zemerick, Eric Pugh)
+
 Improvements
 ---------------------
 
diff --git a/solr/modules/analysis-extras/README.md 
b/solr/modules/analysis-extras/README.md
index b30afef5396..44aee8d1c92 100644
--- a/solr/modules/analysis-extras/README.md
+++ b/solr/modules/analysis-extras/README.md
@@ -24,7 +24,8 @@ upon large dependencies/dictionaries.
 It includes integration with ICU for multilingual support,
 analyzers for Chinese and Polish, and integration with
 OpenNLP for multilingual tokenization, part-of-speech tagging
-lemmatization, phrase chunking, and named-entity recognition.
+lemmatization, phrase chunking, and named-entity recognition
+including being able to run models sourced from Huggingface.
 
 Each of the jars below relies upon including 
`/modules/analysis-extras/lib/solr-analysis-extras-X.Y.Z.jar`
 in the `solrconfig.xml`
diff --git 
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
 
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
index 090964d3178..27e673a8423 100644
--- 
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
+++ 
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
@@ -54,6 +54,55 @@ import org.apache.solr.util.plugin.SolrCoreAware;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+/**
+ * Classifies text in fields using a model via OpenNLP <code>modelFile</code> 
from the values found
+ * in any matching <code>source</code> field into a configured 
<code>dest</code> field.
+ *
+ * <p>See the <a
+ * 
href="https://solr.apache.org/guide/solr/latest/getting-started/tutorial-opennlp.html";>Tutorial</a>
+ * for the step by step guide.
+ *
+ * <p>The <code>source</code> field(s) can be configured as either:
+ *
+ * <ul>
+ *   <li>One or more <code>&lt;str&gt;</code>
+ *   <li>An <code>&lt;arr&gt;</code> of <code>&lt;str&gt;</code>
+ *   <li>A <code>&lt;lst&gt;</code> containing {@link 
FieldMutatingUpdateProcessor
+ *       FieldMutatingUpdateProcessorFactory style selector arguments}
+ * </ul>
+ *
+ * <p>The <code>dest</code> field can be a single <code>&lt;str&gt;</code> 
containing the literal
+ * name of a destination field, or it may be a <code>&lt;lst&gt;</code> 
specifying a regex <code>
+ * pattern</code> and a <code>replacement</code> string. If the pattern + 
replacement option is used
+ * the pattern will be matched against all fields matched by the source 
selector, and the
+ * replacement string (including any capture groups specified from the 
pattern) will be evaluated a
+ * using {@link Matcher#replaceAll(String)} to generate the literal name of 
the destination field.
+ *
+ * <p>If the resolved <code>dest</code> field already exists in the document, 
then the named
+ * entities extracted from the <code>source</code> fields will be added to it.
+ *
+ * <p>In the example below:
+ *
+ * <ul>
+ *   <li>Classification will be performed on the <code>text</code> field and 
added to the <code>
+ *       text_sentiment</code> field
+ * </ul>
+ *
+ * <pre class="prettyprint">
+ * &lt;updateRequestProcessorChain name="sentimentClassifier"&gt;
+ *   &lt;processor 
class="solr.processor.DocumentCategorizerUpdateProcessorFactory"&gt;
+ *     &lt;str name="modelFile"&gt;models/sentiment/model.onnx&lt;/str&gt;
+ *     &lt;str name="vocabFile"&gt;models/sentiment/vocab.txt&lt;/str&gt;
+ *     &lt;str name="source"&gt;text&lt;/str&gt;
+ *     &lt;str name="dest"&gt;text_sentiment&lt;/str&gt;
+ *   &lt;/processor&gt;
+ *   &lt;processor class="solr.LogUpdateProcessorFactory" /&gt;
+ *   &lt;processor class="solr.RunUpdateProcessorFactory" /&gt;
+ * &lt;/updateRequestProcessorChain&gt;
+ * </pre>
+ *
+ * @since 10.0.0
+ */
 public class DocumentCategorizerUpdateProcessorFactory extends 
UpdateRequestProcessorFactory
     implements SolrCoreAware {
 
@@ -69,16 +118,15 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
   private Path solrHome;
 
   private SelectorParams srcInclusions = new SelectorParams();
-  private Collection<SelectorParams> srcExclusions = new ArrayList<>();
+  private final Collection<SelectorParams> srcExclusions = new ArrayList<>();
 
   private FieldNameSelector srcSelector = null;
 
   private String model = null;
   private String vocab = null;
-  private String analyzerFieldType = null;
 
   /**
-   * If pattern is null, this this is a literal field name. If pattern is 
non-null then this is a
+   * If pattern is null, then this is a literal field name. If pattern is 
non-null then this is a
    * replacement string that may contain meta-characters (ie: capture group 
identifiers)
    *
    * @see #pattern
@@ -277,11 +325,10 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
             throw new SolrException(
                 SERVER_ERROR, "Init param '" + SOURCE_PARAM + "' child 
'exclude' can not be null");
           }
-          if (!(excObj instanceof NamedList)) {
+          if (!(excObj instanceof NamedList<?> exc)) {
             throw new SolrException(
                 SERVER_ERROR, "Init param '" + SOURCE_PARAM + "' child 
'exclude' must be <lst/>");
           }
-          NamedList<?> exc = (NamedList<?>) excObj;
           srcExclusions.add(parseSelectorParams(exc));
           if (0 < exc.size()) {
             throw new SolrException(
@@ -328,8 +375,7 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
               + "for OpenNLPExtractNamedEntitiesUpdateProcessor for further 
details.");
     }
 
-    if (d instanceof NamedList) {
-      NamedList<?> destList = (NamedList<?>) d;
+    if (d instanceof NamedList<?> destList) {
 
       Object patt = destList.remove(PATTERN_PARAM);
       Object replacement = destList.remove(REPLACEMENT_PARAM);
@@ -450,9 +496,7 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
                   getCategories(),
                   new AverageClassificationScoringStrategy(),
                   new InferenceOptions());
-        } catch (IOException e) {
-          log.warn("Attempted to initialize documentCategorizerDL", e);
-        } catch (OrtException e) {
+        } catch (IOException | OrtException e) {
           log.warn("Attempted to initialize documentCategorizerDL", e);
         }
       }
@@ -490,16 +534,15 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
 
           for (Object val : srcFieldValues) {
             for (Pair<String, String> entity : classify(val)) {
-              SolrInputField destField = null;
+              SolrInputField destField;
               // String classification = entity.first();
               String classificationValue = entity.second();
-              final String resolved = resolvedDest;
-              if (doc.containsKey(resolved)) {
-                destField = doc.getField(resolved);
+              if (doc.containsKey(resolvedDest)) {
+                destField = doc.getField(resolvedDest);
               } else {
-                SolrInputField targetField = destMap.get(resolved);
+                SolrInputField targetField = destMap.get(resolvedDest);
                 if (targetField == null) {
-                  destField = new SolrInputField(resolved);
+                  destField = new SolrInputField(resolvedDest);
                 } else {
                   destField = targetField;
                 }
@@ -507,14 +550,12 @@ public class DocumentCategorizerUpdateProcessorFactory 
extends UpdateRequestProc
               destField.addValue(classificationValue);
 
               // put it in map to avoid concurrent modification...
-              destMap.put(resolved, destField);
+              destMap.put(resolvedDest, destField);
             }
           }
         }
 
-        for (Map.Entry<String, SolrInputField> entry : destMap.entrySet()) {
-          doc.put(entry.getKey(), entry.getValue());
-        }
+        doc.putAll(destMap);
         super.processAdd(cmd);
       }
 
diff --git 
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
 
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
index d8e81977523..575368372c0 100644
--- 
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
+++ 
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
@@ -28,6 +28,7 @@ import java.util.HashMap;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
+import java.util.Objects;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import java.util.regex.PatternSyntaxException;
@@ -74,7 +75,7 @@ import org.slf4j.LoggerFactory;
  * </pre>
  *
  * <p>See the <a href="https://opennlp.apache.org/models.html";>OpenNLP 
website</a> for information
- * on downloading pre-trained models. Note that in order to use model files 
larger than 1MB on
+ * on downloading pre-trained models. Note that in order to use model files 
larger than 1 MB on
  * SolrCloud, <a
  * 
href="https://solr.apache.org/guide/solr/latest/deployment-guide/zookeeper-ensemble.html#increasing-the-file-size-limit";>ZooKeeper
  * server and client configuration is required</a>.
@@ -186,7 +187,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
   public static final String ENTITY_TYPE = "{EntityType}";
 
   private SelectorParams srcInclusions = new SelectorParams();
-  private Collection<SelectorParams> srcExclusions = new ArrayList<>();
+  private final Collection<SelectorParams> srcExclusions = new ArrayList<>();
 
   private FieldNameSelector srcSelector = null;
 
@@ -194,7 +195,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
   private String analyzerFieldType = null;
 
   /**
-   * If pattern is null, this this is a literal field name. If pattern is 
non-null then this is a
+   * If pattern is null, then this is a literal field name. If pattern is 
non-null then this is a
    * replacement string that may contain meta-characters (ie: capture group 
identifiers)
    *
    * @see #pattern
@@ -358,9 +359,8 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
     //
     //   source != null && dest != null
 
-    // if we got here we know we had source and dest, now check for the other 
two so that we can
-    // give a better
-    // message than "unexpected"
+    // if we got here we know we have source and dest, now check for the other 
two so that we can
+    // give a better message than "unexpected"
     if (0 <= args.indexOf(PATTERN_PARAM, 0) || 0 <= 
args.indexOf(REPLACEMENT_PARAM, 0)) {
       throw new SolrException(
           SERVER_ERROR,
@@ -419,7 +419,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
                   + "' contains unexpected child param(s): "
                   + selectorConfig);
         }
-        // consume from the named list so it doesn't interfere with subsequent 
processing
+        // consume from the named list, so it doesn't interfere with 
subsequent processing
         sources.remove(0);
       }
     }
@@ -537,7 +537,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
     final FieldNameSelector srcSelector = getSourceSelector();
     return new UpdateRequestProcessor(next) {
       private final NLPNERTaggerOp nerTaggerOp;
-      private Analyzer analyzer = null;
+      private final Analyzer analyzer;
 
       {
         try {
@@ -590,7 +590,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
 
           for (Object val : srcFieldValues) {
             for (Pair<String, String> entity : extractTypedNamedEntities(val)) 
{
-              SolrInputField destField = null;
+              SolrInputField destField;
               String entityName = entity.first();
               String entityType = entity.second();
               final String resolved = resolvedDest.replace(ENTITY_TYPE, 
entityType);
@@ -598,11 +598,8 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
                 destField = doc.getField(resolved);
               } else {
                 SolrInputField targetField = destMap.get(resolved);
-                if (targetField == null) {
-                  destField = new SolrInputField(resolved);
-                } else {
-                  destField = targetField;
-                }
+                destField =
+                    Objects.requireNonNullElseGet(targetField, () -> new 
SolrInputField(resolved));
               }
               destField.addValue(entityName);
 
@@ -612,9 +609,7 @@ public class 
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
           }
         }
 
-        for (Map.Entry<String, SolrInputField> entry : destMap.entrySet()) {
-          doc.put(entry.getKey(), entry.getValue());
-        }
+        doc.putAll(destMap);
         super.processAdd(cmd);
       }
 
diff --git 
a/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
 
b/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
index c763f539cfb..2dd66181b51 100644
--- 
a/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
+++ 
b/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
@@ -16,6 +16,8 @@
 // specific language governing permissions and limitations
 // under the License.
 
+:onnx: https://onnx.ai/
+
 Every update request received by Solr is run through a chain of plugins known 
as Update Request Processors, or _URPs_.
 
 This can be useful, for example, to add a field to the document being indexed; 
to change the value of a particular field; or to drop an update if the incoming 
document doesn't fulfill certain criteria.
@@ -430,6 +432,10 @@ The 
{solr-javadocs}/modules/analysis-extras/index.html[`analysis-extras`] module
 
{solr-javadocs}/modules/analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html[OpenNLPExtractNamedEntitiesUpdateProcessorFactory]:::
 Update document(s) to be indexed with named entities extracted using an 
OpenNLP NER model.
 Note that in order to use model files larger than 1MB on SolrCloud, you must 
xref:deployment-guide:zookeeper-ensemble#increasing-the-file-size-limit[configure
 both ZooKeeper server and clients].
 
+{solr-javadocs}/modules/analysis-extras/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.html[DocumentCategorizerUpdateProcessorFactory]:::
 Classify text in fields using models.  These models must be in {onnx}[ONNX] 
format and can be sourced from Huggingface and run directly in Solr via OpenNLP.
+Learn more by following the 
xref:getting-started:tutorial-opennlp.adoc[sentiment analysis tutorial with 
OpenNLP and ONNX models].
+
+
 === Update Processor Factories You Should _Not_ Modify or Remove
 
 These are listed for completeness, but are part of the Solr infrastructure, 
particularly SolrCloud.
diff --git 
a/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc 
b/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
index 47bea772000..095f679f93b 100644
--- a/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
+++ b/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
@@ -33,6 +33,7 @@
 ** xref:tutorial-paramsets.adoc[]
 ** xref:tutorial-vectors.adoc[]
 ** xref:tutorial-solrcloud.adoc[]
+** xref:tutorial-opennlp.adoc[]
 ** xref:tutorial-aws.adoc[]
 
 * xref:solr-admin-ui.adoc[]
diff --git 
a/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc 
b/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
index 771e386a053..4f3a1f333ff 100644
--- a/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
+++ b/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
@@ -29,7 +29,7 @@ The xref:tutorial-films.adoc[second exercise] works with a 
different set of data
 The xref:tutorial-diy.adoc[third exercise] encourages you to begin to work 
with your own data and start a plan for your implementation.
 
 The tutorial also includes other, more advanced, exercises that introduce you 
to xref:tutorial-paramsets.adoc[ParamSets],
-xref:tutorial-vectors.adoc[vector search], 
xref:tutorial-solrcloud.adoc[SolrCloud], and xref:tutorial-aws.adoc[deploying 
Solr to AWS].
+xref:tutorial-vectors.adoc[vector search], 
xref:tutorial-opennlp.adoc[sentiment analysis with OpenNLP], 
xref:tutorial-solrcloud.adoc[SolrCloud], and xref:tutorial-aws.adoc[deploying 
Solr to AWS].
 
 Finally, we'll introduce <<Spatial Queries,spatial search>>, and show you how 
to get your Solr instance back into a clean state.
 
diff --git 
a/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc 
b/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc
new file mode 100644
index 00000000000..491855e1e4a
--- /dev/null
+++ b/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc
@@ -0,0 +1,463 @@
+= Exercise 7: Sentiment Analysis with OpenNLP
+:experimental:
+:tabs-sync-option:
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+[[exercise-opennlp]]
+== Exercise 7: Using OpenNLP and ONNX Models for Sentiment Analysis in Solr
+
+This tutorial demonstrates how to enhance Solr with advanced Natural Language 
Processing (NLP) capabilities through Apache OpenNLP and ONNX. 
+You'll learn how to set up a sentiment analysis pipeline that automatically 
classifies documents during indexing.
+
+We are going to use the 
https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment[bert-base-multilingual-uncased-sentiment]
 model in the tutorial, however there are many others you can use.
+
+----
+is a bert-base-multilingual-uncased model finetuned for sentiment analysis on 
product reviews in 
+six languages: English, Dutch, German, French, Spanish, and Italian. 
+It predicts the sentiment of the review as a number of stars (between 1 and 5).
+----
+
+=== Step 1: Start Solr with Required Modules
+
+To enable NLP processing in Solr, start Solr with the `analysis-extras` module 
and package support:
+
+[,console]
+----
+$ export SOLR_SECURITY_MANAGER_ENABLED=false
+$ bin/solr start -m 4g -Dsolr.modules=analysis-extras -Denable.packages=true
+----
+
+[NOTE]
+====
+We disable the security manager to allow loading of the ONNX runtime. In older 
JVM's Solr runs with a security manager and you would need to disable it, in 
newer ones it is disabled.
+====
+
+=== Step 2: Download the Required Model Files
+
+For sentiment analysis, we need two essential files:
+
+1. An ONNX model file that contains the neural network
+2. A vocabulary file that maps tokens to IDs for the model
+
+Let's create a directory for our models and download them:
+
+[,console]
+----
+$ mkdir -p ./downloads/sentiment/
+$ wget -O ./downloads/sentiment/model.onnx 
https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/resolve/main/onnx/model_quantized.onnx
+$ wget -O ./downloads/sentiment/vocab.txt 
https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/raw/main/vocab.txt
+----
+
+If you do not have `wget` installed you will need to adjust the above command 
or download manually.
+
+.About ONNX Models
+[sidebar]
+****
+ONNX (Open Neural Network Exchange) is an open format for representing machine 
learning models. 
+It allows models trained in different frameworks (like PyTorch, TensorFlow, or 
Hugging Face) to be exported to a standard format that can be used by various 
runtime environments.
+Solr gains access to ONNX models via OpenNLP.
+
+The model we're using is a multilingual BERT model fine-tuned for sentiment 
classification and quantized for better performance. It produces 
classifications on a 5-point scale from "very bad" to "very good".
+
+Learn more about ONNX at https://onnx.ai[onnx.ai^, role="external", 
window="_blank"].
+****
+
+=== Step 3: Create a Collection for Sentiment Analysis
+
+Create a new collection for our sentiment analysis experiments:
+
+[,console]
+----
+$ bin/solr create -c sentiment
+----
+
+=== Step 4: Configure the Schema
+
+We need to add fields to our schema to store both the input text and the 
sentiment classification results:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' --data-binary '{
+  "add-field":{
+    "name":"name",
+    "type":"string",
+    "stored":true }
+}' "http://localhost:8983/solr/sentiment/schema";
+----
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' --data-binary '{
+  "add-field":{
+    "name":"name_sentiment",
+    "type":"string",
+    "stored":true }
+}' "http://localhost:8983/solr/sentiment/schema";
+----
+
+=== Step 5: Upload the Model Files to Solr's FileStore
+
+Solr's FileStore provides a distributed file storage mechanism for SolrCloud. 
Upload our model files there:
+
+[,console]
+----
+$ curl --data-binary @./downloads/sentiment/vocab.txt -X PUT 
"http://localhost:8983/api/cluster/filestore/files/models/sentiment/vocab.txt";
+----
+
+[,console]
+----
+$ curl --data-binary @./downloads/sentiment/model.onnx -X PUT 
"http://localhost:8983/api/cluster/filestore/files/models/sentiment/model.onnx";
+----
+
+.Understanding Solr's FileStore
+[sidebar]
+****
+Solr's FileStore is a distributed file storage system that replicates files 
across the SolrCloud cluster. Files uploaded to the FileStore are accessible by 
all Solr nodes, making it ideal for storing resources like models and 
vocabularies.
+
+When you reference these files in configuration, you use paths relative to the 
FileStore root.
+****
+
+=== Step 6: Configure the Document Categorizer Update Processor
+
+Now we'll configure the update processor that will analyze sentiment during 
document indexing:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' -d '{
+  "add-updateprocessor": {
+    "name": "sentimentClassifier",
+    "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
+    "modelFile": "models/sentiment/model.onnx",
+    "vocabFile": "models/sentiment/vocab.txt",
+    "source": "name",
+    "dest": "name_sentiment"
+  }
+}' "http://localhost:8983/solr/sentiment/config";
+----
+
+This configuration creates an update processor that:
+
+* Takes text from the `name` field
+* Processes it through the sentiment model
+* Stores the sentiment classification in the `name_sentiment` field
+
+.Required Parameters for DocumentCategorizerUpdateProcessorFactory
+[cols="1,4"]
+|===
+|Parameter |Description
+
+|`modelFile`
+|Path to the ONNX model file in the FileStore (required)
+
+|`vocabFile`
+|Path to the vocabulary file in the FileStore (required)
+
+|`source`
+|Field(s) containing text to analyze (required)
+
+|`dest`
+|Field where sentiment results will be stored (required)
+|===
+
+=== Step 7: Index Documents with Sentiment Analysis
+
+Let's index some sample documents to see the sentiment analysis in action:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' -d '[
+  {
+    "id":"good",
+    "name": "that was an awesome movie!"
+  },
+  {
+    "id":"bad",
+    "name": "that movie was bad and terrible"
+  }
+]' 
"http://localhost:8983/solr/sentiment/update/json?processor=sentimentClassifier&commit=true";
+----
+
+Notice that we specify the processor name with `processor=sentimentClassifier` 
in the URL.
+
+=== Step 8: Query and Verify the Results
+
+Query the documents to see the sentiment classifications:
+
+[,console]
+----
+$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:good";
+----
+
+You should see the positive review classified as "very good":
+
+[,json]
+----
+{
+  "response":{"numFound":1,"start":0,"docs":[
+    {
+      "id":"good",
+      "name":"that was an awesome movie!",
+      "name_sentiment":"very good",
+      "_version_":1687591998864932864}]
+  }
+}
+----
+
+Check the negative review:
+
+[,console]
+----
+$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:bad";
+----
+
+The result should show "very bad" sentiment:
+
+[,json]
+----
+{
+  "response":{"numFound":1,"start":0,"docs":[
+    {
+      "id":"bad",
+      "name":"that movie was bad and terrible",
+      "name_sentiment":"very bad",
+      "_version_":1687591998897568768}]
+  }
+}
+----
+
+=== Advanced Configuration Options
+
+The `DocumentCategorizerUpdateProcessorFactory` supports several advanced 
configuration options. Here are some examples from real-world use cases:
+
+==== Processing Multiple Source Fields
+
+You can specify multiple source fields either as separate `source` parameters 
or as an array:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+  <str name="modelFile">models/sentiment/model.onnx</str>
+  <str name="vocabFile">models/sentiment/vocab.txt</str>
+  <str name="source">title</str>
+  <str name="source">content</str>
+  <str name="dest">document_sentiment</str>
+</processor>
+----
+
+Or using JSON configuration:
+
+[,json]
+----
+{
+  "add-updateprocessor": {
+    "name": "multiFieldSentiment",
+    "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
+    "modelFile": "models/sentiment/model.onnx",
+    "vocabFile": "models/sentiment/vocab.txt",
+    "source": ["title", "content", "comments"],
+    "dest": "document_sentiment"
+  }
+}
+----
+
+==== Using Field Pattern Matching (Regex)
+
+You can use regular expressions to select fields to process:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+  <str name="modelFile">models/sentiment/model.onnx</str>
+  <str name="vocabFile">models/sentiment/vocab.txt</str>
+  <lst name="source">
+    <str name="fieldRegex">.*_text$|comments_.*</str>
+  </lst>
+  <str name="dest">sentiment</str>
+</processor>
+----
+
+This will process any field ending with `\_text` or starting with `comments_`.
+
+==== Dynamic Destination Field Names
+
+You can dynamically generate destination field names based on source field 
patterns:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+  <str name="modelFile">models/sentiment/model.onnx</str>
+  <str name="vocabFile">models/sentiment/vocab.txt</str>
+  <lst name="source">
+    <str name="fieldRegex">review_\d+_text</str>
+  </lst>
+  <lst name="dest">
+    <str name="pattern">review_(\d+)_text</str>
+    <str name="replacement">review_$1_sentiment</str>
+  </lst>
+</processor>
+----
+
+This would process fields like `review_1_text` and store results in 
corresponding fields like `review_1_sentiment`.
+
+==== Field Selection with Exclusions
+
+You can include certain fields and exclude others:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+  <str name="modelFile">models/sentiment/model.onnx</str>
+  <str name="vocabFile">models/sentiment/vocab.txt</str>
+  <lst name="source">
+    <str name="fieldRegex">text.*</str>
+    <lst name="exclude">
+      <str name="fieldRegex">text\_private\_.*</str>
+    </lst>
+  </lst>
+  <str name="dest">sentiment</str>
+</processor>
+----
+
+This selects all fields starting with `text` except those starting with 
`text_private_`.
+
+==== Creating a Custom Update Processor Chain
+
+For a permanent configuration, define an update processor chain in 
`solrconfig.xml`:
+
+[,xml]
+----
+<updateRequestProcessorChain name="sentiment-analysis-chain">
+  <processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+    <str name="modelFile">models/sentiment/model.onnx</str>
+    <str name="vocabFile">models/sentiment/vocab.txt</str>
+    <str name="source">name</str>
+    <str name="dest">name_sentiment</str>
+  </processor>
+  <processor class="solr.LogUpdateProcessorFactory" />
+  <processor class="solr.RunUpdateProcessorFactory" />
+</updateRequestProcessorChain>
+----
+
+You can then use this chain by default or explicitly reference it when 
indexing:
+
+[,console]
+----
+$ curl 
"http://localhost:8983/solr/sentiment/update/json?update.chain=sentiment-analysis-chain";
 -d '...'
+----
+
+=== Practical Applications of Sentiment Analysis in Solr
+
+==== Faceting by Sentiment
+
+Create facets based on sentiment to understand opinion distribution:
+
+[,console]
+----
+$ curl 
"http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.field=name_sentiment";
+----
+
+==== Filtering by Sentiment
+
+Filter search results to show only documents with specific sentiment:
+
+[,console]
+----
+$ curl 
"http://localhost:8983/solr/sentiment/select?q=product_type:electronics&fq=name_sentiment:very%20good";
+----
+
+==== Boosting by Sentiment
+
+Boost documents with positive sentiment in search results:
+
+[,console]
+----
+$ curl 
"http://localhost:8983/solr/sentiment/select?q=*:*&defType=edismax&bq=name_sentiment:very%20good^5.0";
+----
+
+==== Time-Based Sentiment Analysis
+
+Analyze sentiment trends over time using time-based queries and facets:
+
+[,console]
+----
+$ curl 
"http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.range=timestamp&facet.range.start=NOW/DAY-30DAY&facet.range.end=NOW&facet.range.gap=%2B1DAY&facet.pivot=timestamp,name_sentiment";
+----
+
+=== Performance Considerations
+
+When using ONNX models in Solr, consider these performance aspects:
+
+* **Memory Usage**: ONNX models can be memory-intensive. Ensure sufficient 
heap space.
+* **Batch Processing**: For large document sets, consider batching updates.
+* **Model Size**: Quantized models (like the one in our example) offer better 
performance.
+* **CPU Utilization**: NLP processing is CPU-intensive. Consider CPU resources 
when planning deployments.  We anticipate in the future leveraging ONNX on the 
GPU.
+* **Response Time Impact**: The additional processing increases indexing time 
but not query time.
+
+A pattern that has been demonstrated is to index each document twice.
+The first time you index the document without any sentiment analysis so you 
get the basic data into the index quickly and made available to users.
+The second time you enable the `update.chain` and that performs the sentiment 
analysis.
+
+=== Going Beyond Sentiment Analysis
+
+The same approach can be extended to other NLP tasks using different models:
+
+* **Named Entity Recognition**: Use 
`OpenNLPExtractNamedEntitiesUpdateProcessorFactory` to identify entities
+* **Language Detection**: Use `OpenNLPLangDetectUpdateProcessorFactory` for 
automatic language identification
+* **Document Classification**: Use custom models for topic or category 
classification
+* **Summarization**: Extract key sentences or generate summaries during 
indexing
+
+=== Troubleshooting
+
+==== Common Issues and Solutions
+
+1. **Model Loading Errors**:
+   * Ensure paths to model files are correct
+   * Verify models are properly uploaded to the FileStore
+   * Check that the security manager is configured to allow ONNX
+
+2. **Out of Memory Errors**:
+   * Increase JVM heap space with `-m` parameter
+   * Use quantized models to reduce memory usage
+   * Process documents in smaller batches
+
+3. **Unexpected Classifications**:
+   * Check that text preprocessing matches model expectations
+   * Ensure vocabulary file corresponds to the model
+   * Consider text normalization in your schema definition
+
+=== Conclusion
+
+In this tutorial, you learned how to:
+
+1. Configure Solr with OpenNLP and ONNX runtime
+2. Load and use a pre-trained sentiment analysis model
+3. Set up a document categorizer update processor
+4. Process documents with automatic sentiment classification
+5. Use advanced configuration options for complex scenarios
+6. Apply sentiment analysis in practical search applications
+
+This integration demonstrates how Solr can leverage modern NLP capabilities to 
enhance search and analytics functionality. By automatically enriching 
documents with sentiment information during indexing, you can provide more 
nuanced search experiences and gain deeper insights into your text data.
+
+=== Cleaning Up
+
+When you're done with this tutorial, stop Solr:
+
+[,console]
+----
+$ bin/solr stop --all
+----
diff --git 
a/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc 
b/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
index f5bdf34f1a7..1821dadbf8e 100644
--- 
a/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
+++ 
b/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
@@ -110,6 +110,14 @@ HTTP requests to SolrCloud that are for a specific core 
must be delivered to the
 Previously, SolrCloud would try too hard scanning the cluster's state to look 
for it and internally route/proxy it.
 If only one node is exposed to a client, and if the client uses the bin/solr 
export tool, it probably won't work.
 
+=== Modern NLP Models from Apache OpenNLP with Solr
+
+Solr now lets you access models encoded in ONNX format, commonly sourced from 
HuggingFace.  
+The DocumentCategorizerUpdateProcessorFactorythat lets you perform sentiment 
and other classification tasks on fields.
+It is available as part of the `analysis-extras` module.
+
+
+
 === Deprecation removals
 
 * The `jaegertracer-configurator` module, which was deprecated in 9.2, is 
removed. Users should migrate to the `opentelemetry` module.

(solr) branch main updated: SOLR-17023: Add documentation and tutorial to new ONNX model feature (#3663)

Reply via email to