Hi,

We are having a client with business model that requires indexing each
month billion rows into solr from mysql in a small time-frame. The
documents are very light, but the number is very high and we need to
achieve speeds of around 80-100k/s. The built in solr indexer goes to
40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows
down as hours go by.

Therefore we have developed a custom java importer that connects directly
to mysql and solrcloud via zookeeper, grabs data from mysql, creates
documents and then imports into solr. This helps because we are opening ~50
threads and the indexing process speeds up. We have optimized the mysql
queries ( mysql was the initial bottleneck ) and the speeds we get now are
over 100k/s, but as index number gets bigger, solr stays very long on
adding documents. I assume it needs to be something from solrconfig that
makes solr stay and even block after 100 mil documents indexed.

Here is the java code that creates documents and then adds to solr server:

public void createDocuments() throws SQLException, SolrServerException,
IOException
{
App.logger.write("Creating documents..");
this.docs = new ArrayList<SolrInputDocument>();
App.logger.incrementNumberOfRows(this.size);
while(this.results.next())
{ this.docs.add(this.getDocumentFromResultSet(this.results)); }

this.statement.close();
this.results.close();
}

public void commitDocuments() throws SolrServerException, IOException
{ App.logger.write("Committing.."); App.solrServer.add(this.docs); // here
it stays very long and then blocks
App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); }

I am also pasting solrconfig.xml parameters that make sense to this
discussion:
<maxIndexingThreads>128</maxIndexingThreads>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>10000</ramBufferSizeMB>
<maxBufferedDocs>1000000</maxBufferedDocs>
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">20000</int>
<int name="segmentsPerTier">1000000</int>
<int name="maxMergeAtOnceExplicit">10000</int>
</mergePolicy>
<mergeFactor>100</mergeFactor>
<termIndexInterval>1024</termIndexInterval>
<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1000000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>2000000</maxTime>
</autoSoftCommit>

The big problem stands in SOLR, because I've run the mysql queries single
and speed is great, but as the time passes solr adding function stays way
too long and then it blocks, even tho server is top level and has lots of
resources.

I'm new to this so please assist. Thanks,
-- 

**

  *Radu Ghita *--------------------------------

  Tel:   +40 721 18 18 68

  Fax:  +40 351 81 85 52

Reply via email to