[jira] [Comment Edited] (SOLR-9384) Add randomization to the train Streaming Expression to support very large training sets

Cao Manh Dat (JIRA) Thu, 04 Aug 2016 21:15:43 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408837#comment-15408837
 ]


Cao Manh Dat edited comment on SOLR-9384 at 8/5/16 4:14 AM:
------------------------------------------------------------

Hi Joel,

I think we should support both randomization & full training of large dataset. 
For full training a large dataset we can split documents into batches ( for 
example : {{docId % batchId}} ) and run the train in sequence for each batch. 
So the number of TermEnum seeks will be equal to number of batches.


was (Author: caomanhdat):
Hi Joel,

I think we should support both randomization & full training of large dataset. 
For full training a large dataset we can split documents into batches ( for 
example : {{ docId % batchId }} ) and run the train in sequence for each batch. 
So the number of TermEnum seeks will be equal to number of batches.

> Add randomization to the train Streaming Expression to support very large 
> training sets
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-9384
>                 URL: https://issues.apache.org/jira/browse/SOLR-9384
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>
> The *train* (SOLR-9252) Streaming Expression optimizes a logistic regression 
> model on text.
> The initial implementation instantiates a doc vector for each document in the 
> training set on each iteration. The doc vectors are held in memory so, the 
> size of the training set is limited by memory constraints.
> This ticket will add randomization to the algorithm so that a random set of 
> documents from the training set are processed on each iteration. 
> This will allow the train Streaming Expression to be run on much larger 
> training sets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9384) Add randomization to the train Streaming Expression to support very large training sets

Reply via email to