[jira] [Updated] (SOLR-12178) Improve efficiency of random sampling

Joel Bernstein (JIRA) Tue, 03 Apr 2018 05:09:43 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-12178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-12178:
----------------------------------
    Description: 
Currently the *random* Streaming Expression performs a distributed random 
sampling using *CloudSolrClient*. This means that a random sample of *N* docs 
from each shard is read into memory on the aggregator node and then a page of 
*N* docs is created from the samples from each shard. Reading all the samples 
from the shards into memory in the aggregator node means the memory consumption 
for random sampling grows as a function of: N*numshards. This clearly limits 
both N and numshards.

This ticket will change the random sampling approach to an approach similar to 
the one used in CloudSolrStream where a stream is generated from the shards 
without reading all the documents into memory.

 

 

 

 

  was:
Currently the *random* Streaming Expression performs a distributed random 
sampling using *CloudSolrClient*. This means that a random sample of *N* docs 
from each shard is read into memory on the aggregator node and then a page of 
*N* docs is created from the samples from from each shard. Reading all the 
samples from the shards into memory in the aggregator node means the memory 
consumption for random sampling grows as a function of: N*numshards. This 
clearly limits both N and numshards.

This ticket will change the random sampling approach to an approach similar to 
the one used in CloudSolrStream where a stream is generated from the shards 
without reading all the documents into memory.

 

 

 

 


> Improve efficiency of random sampling
> -------------------------------------
>
>                 Key: SOLR-12178
>                 URL: https://issues.apache.org/jira/browse/SOLR-12178
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>            Priority: Major
>
> Currently the *random* Streaming Expression performs a distributed random 
> sampling using *CloudSolrClient*. This means that a random sample of *N* docs 
> from each shard is read into memory on the aggregator node and then a page of 
> *N* docs is created from the samples from each shard. Reading all the samples 
> from the shards into memory in the aggregator node means the memory 
> consumption for random sampling grows as a function of: N*numshards. This 
> clearly limits both N and numshards.
> This ticket will change the random sampling approach to an approach similar 
> to the one used in CloudSolrStream where a stream is generated from the 
> shards without reading all the documents into memory.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12178) Improve efficiency of random sampling

Reply via email to