[
https://issues.apache.org/jira/browse/SOLR-9384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408837#comment-15408837
]
Cao Manh Dat edited comment on SOLR-9384 at 8/5/16 4:14 AM:
------------------------------------------------------------
Hi Joel,
I think we should support both randomization & full training of large dataset.
For full training a large dataset we can split documents into batches ( for
example : {{docId % batchId}} ) and run the train in sequence for each batch.
So the number of TermEnum seeks will be equal to number of batches.
was (Author: caomanhdat):
Hi Joel,
I think we should support both randomization & full training of large dataset.
For full training a large dataset we can split documents into batches ( for
example : {{ docId % batchId }} ) and run the train in sequence for each batch.
So the number of TermEnum seeks will be equal to number of batches.
> Add randomization to the train Streaming Expression to support very large
> training sets
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-9384
> URL: https://issues.apache.org/jira/browse/SOLR-9384
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Joel Bernstein
>
> The *train* (SOLR-9252) Streaming Expression optimizes a logistic regression
> model on text.
> The initial implementation instantiates a doc vector for each document in the
> training set on each iteration. The doc vectors are held in memory so, the
> size of the training set is limited by memory constraints.
> This ticket will add randomization to the algorithm so that a random set of
> documents from the training set are processed on each iteration.
> This will allow the train Streaming Expression to be run on much larger
> training sets.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]