[ https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371958#comment-15371958 ]
Joel Bernstein commented on SOLR-9240: -------------------------------------- This ticket is looking fairly good. I did a round of manual testing which works as expected. {code} parallel(workerCollection, workers="2", sort="_version_ desc", daemon(update(updateCollection, batchSize=200, topic(checkpointCollection, topicCollection, q=*:*, id="topic40", fl="id, to , from", partitionKeys="id", initialCheckpoint="0")), runInterval="1000", id="test3")) {code} > Support running the topic() Streaming Expression in parallel mode. > ------------------------------------------------------------------ > > Key: SOLR-9240 > URL: https://issues.apache.org/jira/browse/SOLR-9240 > Project: Solr > Issue Type: Improvement > Reporter: Joel Bernstein > Assignee: Joel Bernstein > Attachments: SOLR-9240.patch, SOLR-9240.patch > > > Currently the topic() function won't run in parallel mode because each worker > needs to maintain a separate set of checkpoints. The proposed solution for > this is to append the worker ID to the topic ID, which will cause each worker > to have it's own checkpoints. > It would be useful to support parallelizing the topic function because it > will provide a general purpose approach for processing text in parallel > across worker nodes. > For example this would allow a classify() function to be wrapped around a > topic() function to classify documents in parallel across worker nodes. > Sample syntax: > {code} > parallel(daemon(update(classify(topic(..., partitionKeys="id"))))) > {code} > The example above would send a daemon to worker nodes that would classify all > documents returned by the topic() function. The update function would send > the output of classify() to a SolrCloud collection for indexing. > The partitionKeys parameter would ensure that each worker would receive a > partition of the results returned by the topic() function. This allows the > classify() function to be run in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org