[jira] [Commented] (SOLR-5075) SolrCloud commit process is too time consuming, even if documents are light

Erick Erickson (JIRA) Thu, 25 Jul 2013 04:50:58 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719537#comment-13719537
 ]


Erick Erickson commented on SOLR-5075:
--------------------------------------

FWIW, I was about to say the same thing, but had one comment.

SOLR-4816 (not in 4.4, but soon) should add some efficiencies to SolrJ 
updating, I'd love to see what the effects in your situation are.

One thing, it looks like you're accumulating all the docs from the select in 
one huge batch and indexing them all at once. If that's true, try submitting 
them, say, 1,000 at a time. I suspect that will not hang, but I also suspect 
that will slow your initial ingest rate because you'll actually be sending docs 
to Solr rather than just accumulating them all locally.
                
> SolrCloud commit process is too time consuming, even if documents are light
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-5075
>                 URL: https://issues.apache.org/jira/browse/SOLR-5075
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis, SolrCloud
>    Affects Versions: 4.1
>         Environment: SolrCloud 4.1, internal Zookeeper, 16 shards, custom 
> java importer.
> Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb 
> SSD and 50tb SAS memory
>            Reporter: Radu Ghita
>              Labels: import, solrconfig.xml
>
> We are having a client with business model that requires indexing each month 
> billion rows into solr from mysql in a small time-frame. The documents are 
> very light, but the number is very high and we need to achieve speeds of 
> around 80-100k/s. The built in solr indexer goes to 40-50k tops, but after 
> some hours ( ~12 ) it crashes and the speed slows down as hours go by.
> Therefore we have developed a custom java importer that connects directly to 
> mysql and solrcloud via zookeeper, grabs data from mysql, creates documents 
> and then imports into solr. This helps because we are opening ~50 threads and 
> the indexing process speeds up. We have optimized the mysql queries ( mysql 
> was the initial bottleneck ) and the speeds we get now are over 100k/s, but 
> as index number gets bigger, solr stays very long on adding documents. I 
> assume it needs to be something from solrconfig that makes solr stay and even 
> block after 100 mil documents indexed.
> Here is the java code that creates documents and then adds to solr server:
> public void createDocuments() throws SQLException, SolrServerException, 
> IOException
>       {
>               App.logger.write("Creating documents..");
>               this.docs = new ArrayList<SolrInputDocument>();
>               App.logger.incrementNumberOfRows(this.size);
>               while(this.results.next())
>               {
>                          
> this.docs.add(this.getDocumentFromResultSet(this.results));
>               }
>               this.statement.close();
>               this.results.close();
>       }
>       
>       public void commitDocuments() throws SolrServerException, IOException
>       {
>               App.logger.write("Committing..");
>               App.solrServer.add(this.docs); // here it stays very long and 
> then blocks
>               App.logger.incrementNumberOfRows(this.docs.size());
>               this.docs.clear();
>       }
> I am also pasting solrconfig.xml parameters that make sense to this 
> discussion:
> <maxIndexingThreads>128</maxIndexingThreads>
> <useCompoundFile>false</useCompoundFile>
> <ramBufferSizeMB>10000</ramBufferSizeMB>
> <maxBufferedDocs>1000000</maxBufferedDocs>
> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>           <int name="maxMergeAtOnce">20000</int>
>           <int name="segmentsPerTier">1000000</int>
>           <int name="maxMergeAtOnceExplicit">10000</int>
> </mergePolicy>
> <mergeFactor>100</mergeFactor>
> <termIndexInterval>1024</termIndexInterval>
> <autoCommit> 
>        <maxTime>15000</maxTime> 
>        <maxDocs>1000000</maxDocs>
>        <openSearcher>false</openSearcher> 
>      </autoCommit>
> <autoSoftCommit> 
>          <maxTime>2000000</maxTime> 
>        </autoSoftCommit>
> Thanks a lot for any answers and excuse my long text, I'm new to this JIRA. 
> If there's any other info needed please let me know.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5075) SolrCloud commit process is too time consuming, even if documents are light

Reply via email to