This is an automated email from the ASF dual-hosted git repository.
epugh pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/solr.git
The following commit(s) were added to refs/heads/main by this push:
new 23afb28dd9a SOLR-17023: Add documentation and tutorial to new ONNX
model feature (#3663)
23afb28dd9a is described below
commit 23afb28dd9af139dbaba99cc9c17827aebf848a5
Author: Eric Pugh <[email protected]>
AuthorDate: Thu Sep 18 13:22:33 2025 -0400
SOLR-17023: Add documentation and tutorial to new ONNX model feature (#3663)
* Add a tutorial on this new capablity.
* Document new feature
---
solr/CHANGES.txt | 2 +
solr/modules/analysis-extras/README.md | 3 +-
.../DocumentCategorizerUpdateProcessorFactory.java | 81 +++-
...ExtractNamedEntitiesUpdateProcessorFactory.java | 29 +-
.../pages/update-request-processors.adoc | 6 +
.../getting-started/getting-started-nav.adoc | 1 +
.../getting-started/pages/solr-tutorial.adoc | 2 +-
.../getting-started/pages/tutorial-opennlp.adoc | 463 +++++++++++++++++++++
.../pages/major-changes-in-solr-10.adoc | 8 +
9 files changed, 556 insertions(+), 39 deletions(-)
diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt
index 6f7621fcad3..ed4924a39ea 100644
--- a/solr/CHANGES.txt
+++ b/solr/CHANGES.txt
@@ -21,6 +21,8 @@ New Features
* SOLR-17780: Add support for scalar quantized dense vectors (Kevin Liang via
Alessandro Benedetti)
+* SOLR-17023: Use Modern NLP Models from Apache OpenNLP with Solr (Jeff
Zemerick, Eric Pugh)
+
Improvements
---------------------
diff --git a/solr/modules/analysis-extras/README.md
b/solr/modules/analysis-extras/README.md
index b30afef5396..44aee8d1c92 100644
--- a/solr/modules/analysis-extras/README.md
+++ b/solr/modules/analysis-extras/README.md
@@ -24,7 +24,8 @@ upon large dependencies/dictionaries.
It includes integration with ICU for multilingual support,
analyzers for Chinese and Polish, and integration with
OpenNLP for multilingual tokenization, part-of-speech tagging
-lemmatization, phrase chunking, and named-entity recognition.
+lemmatization, phrase chunking, and named-entity recognition
+including being able to run models sourced from Huggingface.
Each of the jars below relies upon including
`/modules/analysis-extras/lib/solr-analysis-extras-X.Y.Z.jar`
in the `solrconfig.xml`
diff --git
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
index 090964d3178..27e673a8423 100644
---
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
+++
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.java
@@ -54,6 +54,55 @@ import org.apache.solr.util.plugin.SolrCoreAware;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+/**
+ * Classifies text in fields using a model via OpenNLP <code>modelFile</code>
from the values found
+ * in any matching <code>source</code> field into a configured
<code>dest</code> field.
+ *
+ * <p>See the <a
+ *
href="https://solr.apache.org/guide/solr/latest/getting-started/tutorial-opennlp.html">Tutorial</a>
+ * for the step by step guide.
+ *
+ * <p>The <code>source</code> field(s) can be configured as either:
+ *
+ * <ul>
+ * <li>One or more <code><str></code>
+ * <li>An <code><arr></code> of <code><str></code>
+ * <li>A <code><lst></code> containing {@link
FieldMutatingUpdateProcessor
+ * FieldMutatingUpdateProcessorFactory style selector arguments}
+ * </ul>
+ *
+ * <p>The <code>dest</code> field can be a single <code><str></code>
containing the literal
+ * name of a destination field, or it may be a <code><lst></code>
specifying a regex <code>
+ * pattern</code> and a <code>replacement</code> string. If the pattern +
replacement option is used
+ * the pattern will be matched against all fields matched by the source
selector, and the
+ * replacement string (including any capture groups specified from the
pattern) will be evaluated a
+ * using {@link Matcher#replaceAll(String)} to generate the literal name of
the destination field.
+ *
+ * <p>If the resolved <code>dest</code> field already exists in the document,
then the named
+ * entities extracted from the <code>source</code> fields will be added to it.
+ *
+ * <p>In the example below:
+ *
+ * <ul>
+ * <li>Classification will be performed on the <code>text</code> field and
added to the <code>
+ * text_sentiment</code> field
+ * </ul>
+ *
+ * <pre class="prettyprint">
+ * <updateRequestProcessorChain name="sentimentClassifier">
+ * <processor
class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ * <str name="modelFile">models/sentiment/model.onnx</str>
+ * <str name="vocabFile">models/sentiment/vocab.txt</str>
+ * <str name="source">text</str>
+ * <str name="dest">text_sentiment</str>
+ * </processor>
+ * <processor class="solr.LogUpdateProcessorFactory" />
+ * <processor class="solr.RunUpdateProcessorFactory" />
+ * </updateRequestProcessorChain>
+ * </pre>
+ *
+ * @since 10.0.0
+ */
public class DocumentCategorizerUpdateProcessorFactory extends
UpdateRequestProcessorFactory
implements SolrCoreAware {
@@ -69,16 +118,15 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
private Path solrHome;
private SelectorParams srcInclusions = new SelectorParams();
- private Collection<SelectorParams> srcExclusions = new ArrayList<>();
+ private final Collection<SelectorParams> srcExclusions = new ArrayList<>();
private FieldNameSelector srcSelector = null;
private String model = null;
private String vocab = null;
- private String analyzerFieldType = null;
/**
- * If pattern is null, this this is a literal field name. If pattern is
non-null then this is a
+ * If pattern is null, then this is a literal field name. If pattern is
non-null then this is a
* replacement string that may contain meta-characters (ie: capture group
identifiers)
*
* @see #pattern
@@ -277,11 +325,10 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
throw new SolrException(
SERVER_ERROR, "Init param '" + SOURCE_PARAM + "' child
'exclude' can not be null");
}
- if (!(excObj instanceof NamedList)) {
+ if (!(excObj instanceof NamedList<?> exc)) {
throw new SolrException(
SERVER_ERROR, "Init param '" + SOURCE_PARAM + "' child
'exclude' must be <lst/>");
}
- NamedList<?> exc = (NamedList<?>) excObj;
srcExclusions.add(parseSelectorParams(exc));
if (0 < exc.size()) {
throw new SolrException(
@@ -328,8 +375,7 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
+ "for OpenNLPExtractNamedEntitiesUpdateProcessor for further
details.");
}
- if (d instanceof NamedList) {
- NamedList<?> destList = (NamedList<?>) d;
+ if (d instanceof NamedList<?> destList) {
Object patt = destList.remove(PATTERN_PARAM);
Object replacement = destList.remove(REPLACEMENT_PARAM);
@@ -450,9 +496,7 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
getCategories(),
new AverageClassificationScoringStrategy(),
new InferenceOptions());
- } catch (IOException e) {
- log.warn("Attempted to initialize documentCategorizerDL", e);
- } catch (OrtException e) {
+ } catch (IOException | OrtException e) {
log.warn("Attempted to initialize documentCategorizerDL", e);
}
}
@@ -490,16 +534,15 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
for (Object val : srcFieldValues) {
for (Pair<String, String> entity : classify(val)) {
- SolrInputField destField = null;
+ SolrInputField destField;
// String classification = entity.first();
String classificationValue = entity.second();
- final String resolved = resolvedDest;
- if (doc.containsKey(resolved)) {
- destField = doc.getField(resolved);
+ if (doc.containsKey(resolvedDest)) {
+ destField = doc.getField(resolvedDest);
} else {
- SolrInputField targetField = destMap.get(resolved);
+ SolrInputField targetField = destMap.get(resolvedDest);
if (targetField == null) {
- destField = new SolrInputField(resolved);
+ destField = new SolrInputField(resolvedDest);
} else {
destField = targetField;
}
@@ -507,14 +550,12 @@ public class DocumentCategorizerUpdateProcessorFactory
extends UpdateRequestProc
destField.addValue(classificationValue);
// put it in map to avoid concurrent modification...
- destMap.put(resolved, destField);
+ destMap.put(resolvedDest, destField);
}
}
}
- for (Map.Entry<String, SolrInputField> entry : destMap.entrySet()) {
- doc.put(entry.getKey(), entry.getValue());
- }
+ doc.putAll(destMap);
super.processAdd(cmd);
}
diff --git
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
index d8e81977523..575368372c0 100644
---
a/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
+++
b/solr/modules/analysis-extras/src/java/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java
@@ -28,6 +28,7 @@ import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
+import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
@@ -74,7 +75,7 @@ import org.slf4j.LoggerFactory;
* </pre>
*
* <p>See the <a href="https://opennlp.apache.org/models.html">OpenNLP
website</a> for information
- * on downloading pre-trained models. Note that in order to use model files
larger than 1MB on
+ * on downloading pre-trained models. Note that in order to use model files
larger than 1 MB on
* SolrCloud, <a
*
href="https://solr.apache.org/guide/solr/latest/deployment-guide/zookeeper-ensemble.html#increasing-the-file-size-limit">ZooKeeper
* server and client configuration is required</a>.
@@ -186,7 +187,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
public static final String ENTITY_TYPE = "{EntityType}";
private SelectorParams srcInclusions = new SelectorParams();
- private Collection<SelectorParams> srcExclusions = new ArrayList<>();
+ private final Collection<SelectorParams> srcExclusions = new ArrayList<>();
private FieldNameSelector srcSelector = null;
@@ -194,7 +195,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
private String analyzerFieldType = null;
/**
- * If pattern is null, this this is a literal field name. If pattern is
non-null then this is a
+ * If pattern is null, then this is a literal field name. If pattern is
non-null then this is a
* replacement string that may contain meta-characters (ie: capture group
identifiers)
*
* @see #pattern
@@ -358,9 +359,8 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
//
// source != null && dest != null
- // if we got here we know we had source and dest, now check for the other
two so that we can
- // give a better
- // message than "unexpected"
+ // if we got here we know we have source and dest, now check for the other
two so that we can
+ // give a better message than "unexpected"
if (0 <= args.indexOf(PATTERN_PARAM, 0) || 0 <=
args.indexOf(REPLACEMENT_PARAM, 0)) {
throw new SolrException(
SERVER_ERROR,
@@ -419,7 +419,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
+ "' contains unexpected child param(s): "
+ selectorConfig);
}
- // consume from the named list so it doesn't interfere with subsequent
processing
+ // consume from the named list, so it doesn't interfere with
subsequent processing
sources.remove(0);
}
}
@@ -537,7 +537,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
final FieldNameSelector srcSelector = getSourceSelector();
return new UpdateRequestProcessor(next) {
private final NLPNERTaggerOp nerTaggerOp;
- private Analyzer analyzer = null;
+ private final Analyzer analyzer;
{
try {
@@ -590,7 +590,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
for (Object val : srcFieldValues) {
for (Pair<String, String> entity : extractTypedNamedEntities(val))
{
- SolrInputField destField = null;
+ SolrInputField destField;
String entityName = entity.first();
String entityType = entity.second();
final String resolved = resolvedDest.replace(ENTITY_TYPE,
entityType);
@@ -598,11 +598,8 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
destField = doc.getField(resolved);
} else {
SolrInputField targetField = destMap.get(resolved);
- if (targetField == null) {
- destField = new SolrInputField(resolved);
- } else {
- destField = targetField;
- }
+ destField =
+ Objects.requireNonNullElseGet(targetField, () -> new
SolrInputField(resolved));
}
destField.addValue(entityName);
@@ -612,9 +609,7 @@ public class
OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateReq
}
}
- for (Map.Entry<String, SolrInputField> entry : destMap.entrySet()) {
- doc.put(entry.getKey(), entry.getValue());
- }
+ doc.putAll(destMap);
super.processAdd(cmd);
}
diff --git
a/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
b/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
index c763f539cfb..2dd66181b51 100644
---
a/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
+++
b/solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc
@@ -16,6 +16,8 @@
// specific language governing permissions and limitations
// under the License.
+:onnx: https://onnx.ai/
+
Every update request received by Solr is run through a chain of plugins known
as Update Request Processors, or _URPs_.
This can be useful, for example, to add a field to the document being indexed;
to change the value of a particular field; or to drop an update if the incoming
document doesn't fulfill certain criteria.
@@ -430,6 +432,10 @@ The
{solr-javadocs}/modules/analysis-extras/index.html[`analysis-extras`] module
{solr-javadocs}/modules/analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html[OpenNLPExtractNamedEntitiesUpdateProcessorFactory]:::
Update document(s) to be indexed with named entities extracted using an
OpenNLP NER model.
Note that in order to use model files larger than 1MB on SolrCloud, you must
xref:deployment-guide:zookeeper-ensemble#increasing-the-file-size-limit[configure
both ZooKeeper server and clients].
+{solr-javadocs}/modules/analysis-extras/org/apache/solr/update/processor/DocumentCategorizerUpdateProcessorFactory.html[DocumentCategorizerUpdateProcessorFactory]:::
Classify text in fields using models. These models must be in {onnx}[ONNX]
format and can be sourced from Huggingface and run directly in Solr via OpenNLP.
+Learn more by following the
xref:getting-started:tutorial-opennlp.adoc[sentiment analysis tutorial with
OpenNLP and ONNX models].
+
+
=== Update Processor Factories You Should _Not_ Modify or Remove
These are listed for completeness, but are part of the Solr infrastructure,
particularly SolrCloud.
diff --git
a/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
b/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
index 47bea772000..095f679f93b 100644
--- a/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
+++ b/solr/solr-ref-guide/modules/getting-started/getting-started-nav.adoc
@@ -33,6 +33,7 @@
** xref:tutorial-paramsets.adoc[]
** xref:tutorial-vectors.adoc[]
** xref:tutorial-solrcloud.adoc[]
+** xref:tutorial-opennlp.adoc[]
** xref:tutorial-aws.adoc[]
* xref:solr-admin-ui.adoc[]
diff --git
a/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
b/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
index 771e386a053..4f3a1f333ff 100644
--- a/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
+++ b/solr/solr-ref-guide/modules/getting-started/pages/solr-tutorial.adoc
@@ -29,7 +29,7 @@ The xref:tutorial-films.adoc[second exercise] works with a
different set of data
The xref:tutorial-diy.adoc[third exercise] encourages you to begin to work
with your own data and start a plan for your implementation.
The tutorial also includes other, more advanced, exercises that introduce you
to xref:tutorial-paramsets.adoc[ParamSets],
-xref:tutorial-vectors.adoc[vector search],
xref:tutorial-solrcloud.adoc[SolrCloud], and xref:tutorial-aws.adoc[deploying
Solr to AWS].
+xref:tutorial-vectors.adoc[vector search],
xref:tutorial-opennlp.adoc[sentiment analysis with OpenNLP],
xref:tutorial-solrcloud.adoc[SolrCloud], and xref:tutorial-aws.adoc[deploying
Solr to AWS].
Finally, we'll introduce <<Spatial Queries,spatial search>>, and show you how
to get your Solr instance back into a clean state.
diff --git
a/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc
b/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc
new file mode 100644
index 00000000000..491855e1e4a
--- /dev/null
+++ b/solr/solr-ref-guide/modules/getting-started/pages/tutorial-opennlp.adoc
@@ -0,0 +1,463 @@
+= Exercise 7: Sentiment Analysis with OpenNLP
+:experimental:
+:tabs-sync-option:
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+[[exercise-opennlp]]
+== Exercise 7: Using OpenNLP and ONNX Models for Sentiment Analysis in Solr
+
+This tutorial demonstrates how to enhance Solr with advanced Natural Language
Processing (NLP) capabilities through Apache OpenNLP and ONNX.
+You'll learn how to set up a sentiment analysis pipeline that automatically
classifies documents during indexing.
+
+We are going to use the
https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment[bert-base-multilingual-uncased-sentiment]
model in the tutorial, however there are many others you can use.
+
+----
+is a bert-base-multilingual-uncased model finetuned for sentiment analysis on
product reviews in
+six languages: English, Dutch, German, French, Spanish, and Italian.
+It predicts the sentiment of the review as a number of stars (between 1 and 5).
+----
+
+=== Step 1: Start Solr with Required Modules
+
+To enable NLP processing in Solr, start Solr with the `analysis-extras` module
and package support:
+
+[,console]
+----
+$ export SOLR_SECURITY_MANAGER_ENABLED=false
+$ bin/solr start -m 4g -Dsolr.modules=analysis-extras -Denable.packages=true
+----
+
+[NOTE]
+====
+We disable the security manager to allow loading of the ONNX runtime. In older
JVM's Solr runs with a security manager and you would need to disable it, in
newer ones it is disabled.
+====
+
+=== Step 2: Download the Required Model Files
+
+For sentiment analysis, we need two essential files:
+
+1. An ONNX model file that contains the neural network
+2. A vocabulary file that maps tokens to IDs for the model
+
+Let's create a directory for our models and download them:
+
+[,console]
+----
+$ mkdir -p ./downloads/sentiment/
+$ wget -O ./downloads/sentiment/model.onnx
https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/resolve/main/onnx/model_quantized.onnx
+$ wget -O ./downloads/sentiment/vocab.txt
https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/raw/main/vocab.txt
+----
+
+If you do not have `wget` installed you will need to adjust the above command
or download manually.
+
+.About ONNX Models
+[sidebar]
+****
+ONNX (Open Neural Network Exchange) is an open format for representing machine
learning models.
+It allows models trained in different frameworks (like PyTorch, TensorFlow, or
Hugging Face) to be exported to a standard format that can be used by various
runtime environments.
+Solr gains access to ONNX models via OpenNLP.
+
+The model we're using is a multilingual BERT model fine-tuned for sentiment
classification and quantized for better performance. It produces
classifications on a 5-point scale from "very bad" to "very good".
+
+Learn more about ONNX at https://onnx.ai[onnx.ai^, role="external",
window="_blank"].
+****
+
+=== Step 3: Create a Collection for Sentiment Analysis
+
+Create a new collection for our sentiment analysis experiments:
+
+[,console]
+----
+$ bin/solr create -c sentiment
+----
+
+=== Step 4: Configure the Schema
+
+We need to add fields to our schema to store both the input text and the
sentiment classification results:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' --data-binary '{
+ "add-field":{
+ "name":"name",
+ "type":"string",
+ "stored":true }
+}' "http://localhost:8983/solr/sentiment/schema"
+----
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' --data-binary '{
+ "add-field":{
+ "name":"name_sentiment",
+ "type":"string",
+ "stored":true }
+}' "http://localhost:8983/solr/sentiment/schema"
+----
+
+=== Step 5: Upload the Model Files to Solr's FileStore
+
+Solr's FileStore provides a distributed file storage mechanism for SolrCloud.
Upload our model files there:
+
+[,console]
+----
+$ curl --data-binary @./downloads/sentiment/vocab.txt -X PUT
"http://localhost:8983/api/cluster/filestore/files/models/sentiment/vocab.txt"
+----
+
+[,console]
+----
+$ curl --data-binary @./downloads/sentiment/model.onnx -X PUT
"http://localhost:8983/api/cluster/filestore/files/models/sentiment/model.onnx"
+----
+
+.Understanding Solr's FileStore
+[sidebar]
+****
+Solr's FileStore is a distributed file storage system that replicates files
across the SolrCloud cluster. Files uploaded to the FileStore are accessible by
all Solr nodes, making it ideal for storing resources like models and
vocabularies.
+
+When you reference these files in configuration, you use paths relative to the
FileStore root.
+****
+
+=== Step 6: Configure the Document Categorizer Update Processor
+
+Now we'll configure the update processor that will analyze sentiment during
document indexing:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' -d '{
+ "add-updateprocessor": {
+ "name": "sentimentClassifier",
+ "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
+ "modelFile": "models/sentiment/model.onnx",
+ "vocabFile": "models/sentiment/vocab.txt",
+ "source": "name",
+ "dest": "name_sentiment"
+ }
+}' "http://localhost:8983/solr/sentiment/config"
+----
+
+This configuration creates an update processor that:
+
+* Takes text from the `name` field
+* Processes it through the sentiment model
+* Stores the sentiment classification in the `name_sentiment` field
+
+.Required Parameters for DocumentCategorizerUpdateProcessorFactory
+[cols="1,4"]
+|===
+|Parameter |Description
+
+|`modelFile`
+|Path to the ONNX model file in the FileStore (required)
+
+|`vocabFile`
+|Path to the vocabulary file in the FileStore (required)
+
+|`source`
+|Field(s) containing text to analyze (required)
+
+|`dest`
+|Field where sentiment results will be stored (required)
+|===
+
+=== Step 7: Index Documents with Sentiment Analysis
+
+Let's index some sample documents to see the sentiment analysis in action:
+
+[,console]
+----
+$ curl -X POST -H 'Content-type:application/json' -d '[
+ {
+ "id":"good",
+ "name": "that was an awesome movie!"
+ },
+ {
+ "id":"bad",
+ "name": "that movie was bad and terrible"
+ }
+]'
"http://localhost:8983/solr/sentiment/update/json?processor=sentimentClassifier&commit=true"
+----
+
+Notice that we specify the processor name with `processor=sentimentClassifier`
in the URL.
+
+=== Step 8: Query and Verify the Results
+
+Query the documents to see the sentiment classifications:
+
+[,console]
+----
+$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:good"
+----
+
+You should see the positive review classified as "very good":
+
+[,json]
+----
+{
+ "response":{"numFound":1,"start":0,"docs":[
+ {
+ "id":"good",
+ "name":"that was an awesome movie!",
+ "name_sentiment":"very good",
+ "_version_":1687591998864932864}]
+ }
+}
+----
+
+Check the negative review:
+
+[,console]
+----
+$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:bad"
+----
+
+The result should show "very bad" sentiment:
+
+[,json]
+----
+{
+ "response":{"numFound":1,"start":0,"docs":[
+ {
+ "id":"bad",
+ "name":"that movie was bad and terrible",
+ "name_sentiment":"very bad",
+ "_version_":1687591998897568768}]
+ }
+}
+----
+
+=== Advanced Configuration Options
+
+The `DocumentCategorizerUpdateProcessorFactory` supports several advanced
configuration options. Here are some examples from real-world use cases:
+
+==== Processing Multiple Source Fields
+
+You can specify multiple source fields either as separate `source` parameters
or as an array:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ <str name="modelFile">models/sentiment/model.onnx</str>
+ <str name="vocabFile">models/sentiment/vocab.txt</str>
+ <str name="source">title</str>
+ <str name="source">content</str>
+ <str name="dest">document_sentiment</str>
+</processor>
+----
+
+Or using JSON configuration:
+
+[,json]
+----
+{
+ "add-updateprocessor": {
+ "name": "multiFieldSentiment",
+ "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
+ "modelFile": "models/sentiment/model.onnx",
+ "vocabFile": "models/sentiment/vocab.txt",
+ "source": ["title", "content", "comments"],
+ "dest": "document_sentiment"
+ }
+}
+----
+
+==== Using Field Pattern Matching (Regex)
+
+You can use regular expressions to select fields to process:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ <str name="modelFile">models/sentiment/model.onnx</str>
+ <str name="vocabFile">models/sentiment/vocab.txt</str>
+ <lst name="source">
+ <str name="fieldRegex">.*_text$|comments_.*</str>
+ </lst>
+ <str name="dest">sentiment</str>
+</processor>
+----
+
+This will process any field ending with `\_text` or starting with `comments_`.
+
+==== Dynamic Destination Field Names
+
+You can dynamically generate destination field names based on source field
patterns:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ <str name="modelFile">models/sentiment/model.onnx</str>
+ <str name="vocabFile">models/sentiment/vocab.txt</str>
+ <lst name="source">
+ <str name="fieldRegex">review_\d+_text</str>
+ </lst>
+ <lst name="dest">
+ <str name="pattern">review_(\d+)_text</str>
+ <str name="replacement">review_$1_sentiment</str>
+ </lst>
+</processor>
+----
+
+This would process fields like `review_1_text` and store results in
corresponding fields like `review_1_sentiment`.
+
+==== Field Selection with Exclusions
+
+You can include certain fields and exclude others:
+
+[,xml]
+----
+<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ <str name="modelFile">models/sentiment/model.onnx</str>
+ <str name="vocabFile">models/sentiment/vocab.txt</str>
+ <lst name="source">
+ <str name="fieldRegex">text.*</str>
+ <lst name="exclude">
+ <str name="fieldRegex">text\_private\_.*</str>
+ </lst>
+ </lst>
+ <str name="dest">sentiment</str>
+</processor>
+----
+
+This selects all fields starting with `text` except those starting with
`text_private_`.
+
+==== Creating a Custom Update Processor Chain
+
+For a permanent configuration, define an update processor chain in
`solrconfig.xml`:
+
+[,xml]
+----
+<updateRequestProcessorChain name="sentiment-analysis-chain">
+ <processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
+ <str name="modelFile">models/sentiment/model.onnx</str>
+ <str name="vocabFile">models/sentiment/vocab.txt</str>
+ <str name="source">name</str>
+ <str name="dest">name_sentiment</str>
+ </processor>
+ <processor class="solr.LogUpdateProcessorFactory" />
+ <processor class="solr.RunUpdateProcessorFactory" />
+</updateRequestProcessorChain>
+----
+
+You can then use this chain by default or explicitly reference it when
indexing:
+
+[,console]
+----
+$ curl
"http://localhost:8983/solr/sentiment/update/json?update.chain=sentiment-analysis-chain"
-d '...'
+----
+
+=== Practical Applications of Sentiment Analysis in Solr
+
+==== Faceting by Sentiment
+
+Create facets based on sentiment to understand opinion distribution:
+
+[,console]
+----
+$ curl
"http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.field=name_sentiment"
+----
+
+==== Filtering by Sentiment
+
+Filter search results to show only documents with specific sentiment:
+
+[,console]
+----
+$ curl
"http://localhost:8983/solr/sentiment/select?q=product_type:electronics&fq=name_sentiment:very%20good"
+----
+
+==== Boosting by Sentiment
+
+Boost documents with positive sentiment in search results:
+
+[,console]
+----
+$ curl
"http://localhost:8983/solr/sentiment/select?q=*:*&defType=edismax&bq=name_sentiment:very%20good^5.0"
+----
+
+==== Time-Based Sentiment Analysis
+
+Analyze sentiment trends over time using time-based queries and facets:
+
+[,console]
+----
+$ curl
"http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.range=timestamp&facet.range.start=NOW/DAY-30DAY&facet.range.end=NOW&facet.range.gap=%2B1DAY&facet.pivot=timestamp,name_sentiment"
+----
+
+=== Performance Considerations
+
+When using ONNX models in Solr, consider these performance aspects:
+
+* **Memory Usage**: ONNX models can be memory-intensive. Ensure sufficient
heap space.
+* **Batch Processing**: For large document sets, consider batching updates.
+* **Model Size**: Quantized models (like the one in our example) offer better
performance.
+* **CPU Utilization**: NLP processing is CPU-intensive. Consider CPU resources
when planning deployments. We anticipate in the future leveraging ONNX on the
GPU.
+* **Response Time Impact**: The additional processing increases indexing time
but not query time.
+
+A pattern that has been demonstrated is to index each document twice.
+The first time you index the document without any sentiment analysis so you
get the basic data into the index quickly and made available to users.
+The second time you enable the `update.chain` and that performs the sentiment
analysis.
+
+=== Going Beyond Sentiment Analysis
+
+The same approach can be extended to other NLP tasks using different models:
+
+* **Named Entity Recognition**: Use
`OpenNLPExtractNamedEntitiesUpdateProcessorFactory` to identify entities
+* **Language Detection**: Use `OpenNLPLangDetectUpdateProcessorFactory` for
automatic language identification
+* **Document Classification**: Use custom models for topic or category
classification
+* **Summarization**: Extract key sentences or generate summaries during
indexing
+
+=== Troubleshooting
+
+==== Common Issues and Solutions
+
+1. **Model Loading Errors**:
+ * Ensure paths to model files are correct
+ * Verify models are properly uploaded to the FileStore
+ * Check that the security manager is configured to allow ONNX
+
+2. **Out of Memory Errors**:
+ * Increase JVM heap space with `-m` parameter
+ * Use quantized models to reduce memory usage
+ * Process documents in smaller batches
+
+3. **Unexpected Classifications**:
+ * Check that text preprocessing matches model expectations
+ * Ensure vocabulary file corresponds to the model
+ * Consider text normalization in your schema definition
+
+=== Conclusion
+
+In this tutorial, you learned how to:
+
+1. Configure Solr with OpenNLP and ONNX runtime
+2. Load and use a pre-trained sentiment analysis model
+3. Set up a document categorizer update processor
+4. Process documents with automatic sentiment classification
+5. Use advanced configuration options for complex scenarios
+6. Apply sentiment analysis in practical search applications
+
+This integration demonstrates how Solr can leverage modern NLP capabilities to
enhance search and analytics functionality. By automatically enriching
documents with sentiment information during indexing, you can provide more
nuanced search experiences and gain deeper insights into your text data.
+
+=== Cleaning Up
+
+When you're done with this tutorial, stop Solr:
+
+[,console]
+----
+$ bin/solr stop --all
+----
diff --git
a/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
b/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
index f5bdf34f1a7..1821dadbf8e 100644
---
a/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
+++
b/solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
@@ -110,6 +110,14 @@ HTTP requests to SolrCloud that are for a specific core
must be delivered to the
Previously, SolrCloud would try too hard scanning the cluster's state to look
for it and internally route/proxy it.
If only one node is exposed to a client, and if the client uses the bin/solr
export tool, it probably won't work.
+=== Modern NLP Models from Apache OpenNLP with Solr
+
+Solr now lets you access models encoded in ONNX format, commonly sourced from
HuggingFace.
+The DocumentCategorizerUpdateProcessorFactorythat lets you perform sentiment
and other classification tasks on fields.
+It is available as part of the `analysis-extras` module.
+
+
+
=== Deprecation removals
* The `jaegertracer-configurator` module, which was deprecated in 9.2, is
removed. Users should migrate to the `opentelemetry` module.