[ https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534646#comment-15534646 ]
ASF subversion and git services commented on SOLR-9258: ------------------------------------------------------- Commit 5adb8f1bd5905f6749e57b7e27d467a4f36c56b2 in lucene-solr's branch refs/heads/branch_6x from [~joel.bernstein] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5adb8f1 ] SOLR-9258: Update CHANGES.txt > Optimizing, storing and deploying AI models with Streaming Expressions > ---------------------------------------------------------------------- > > Key: SOLR-9258 > URL: https://issues.apache.org/jira/browse/SOLR-9258 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Joel Bernstein > Assignee: Joel Bernstein > Fix For: 6.2 > > Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, > SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, > SOLR-9258.patch > > > This ticket describes a framework for *optimizing*, *storing* and *deploying* > AI models within the Streaming Expression framework. > *Optimizing* > [~caomanhdat], has contributed SOLR-9252 which provides *Streaming > Expressions* for both feature selection and optimization of a logistic > regression text classifier. SOLR-9252 also provides a great working example > of *optimization* of a machine learning model using an in-place parallel > iterative algorithm. > *Storing* > Both features and optimized models can be stored in SolrCloud collections > using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the > pseudo code for storing features would be: > {code} > update(featuresCollection, > featuresSelection(collection1, > id="myFeatures", > q="*:*", > field="tv_text", > outcome="out_i", > positiveLabel=1, > numTerms=100)) > {code} > The id field can be added to the featureSelection expression so that features > can be later retrieved from the collection it's stored in. > *Deploying* > With the introduction of the topic() expression, SolrCloud can be treated as > a distributed message queue. This messaging capability can be used to deploy > models and process data through the models. > To implement this approach a classify() function can be created that uses a > topic() function to return both the model and the data to be classified: > The pseudo code looks like this: > {code} > classify(topic(models, q="modelID", fl="features, weights"), > topic(emails, q="*:*", fl="id, body", rows="500", version="3232323")) > {code} > In the example above the classify() function uses the topic() function to > retrieve the model. Each time there is an update to the model in the index, > the topic() expression will automatically read the new model. > The topic function() is also used to pull in the data set that is being > classified. Notice the *version* parameter. This will be added to the topic > function to support pulling results from a specific version number (jira > ticket to follow). > With this approach both the model and the data to process through the model > are treated as messages in a message queue. > The daemon function can be used to send the classify function to Solr where > it will be run in the background. The pseudo code looks like this: > {code} > daemon(..., > update(classifiedEmails, > classify(topic(models, q="modelID", fl="features, weights"), > topic(emails, q="*:*", fl="id, fl, body", > rows="500", version="3232323")))) > {code} > In this scenario the daemon will run the classify function repeatedly in the > background. With each run the topic() functions will re-pull the model if the > model has been updated. It will also pull a new set of emails to be > classified. The classified emails can be stored in another SolrCloud > collection using the update() function. > Using this approach emails can be classified in batches. The daemon can > continue to run even after all all the emails have been classified. New > emails added to the emails collections will then be automatically classified > when they enter the index. > Classification can be done in parallel once SOLR-9240 is completed. This will > allow topic() results to be partitioned across worker nodes so they can be > processed in parallel. The pseudo code for this is: > {code} > parallel(workerCollection, worker="20", ..., > daemon(..., > update(classifiedEmails, > classify(topic(models, q="modelID", fl="features, > weights", partitionKeys="none"), > topic(emails, q="*:*", fl="id, fl, body", > rows="500", version="3232323", partitionKeys="id"))))) > {code} > The code above sends a daemon to 20 workers, which will each classify a > partition of records pulled by the topic() function. > *AI based alerting* > If the *version* parameter is not supplied to the topic stream it will stream > only new content from the topic, rather then starting from an older version > number. > In this scenario the topic function behaves like an alert. Pseudo code for > alerts look like this: > {code} > daemon(..., > alert(..., > classify(topic(models, q="modelID", fl="features, weights"), > topic(emails, q="*:*", fl="id, fl, body", rows="500")))) > {code} > In the example above an alert() function wraps the classify() function and > takes actions based on the classification of documents. Developers can build > there own alert functions using the Streaming API and plug them in to provide > custom actions. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org