Re: Indexing 20M documents from MySQL with DIH

Robert Gründler Thu, 21 Apr 2011 12:39:59 -0700

we're indexing around 10M records from a mysql database into
a single solr core.


The DataImportHandler needs to join 3 sub-entities to denormalize
the data.

We've run into some troubles for the first 2 attempts, but setting
batchSize="-1" for the dataSource resolved the issues.

Do you need a lot of complex joins to import the data from mysql?



-robert




On 4/21/11 8:08 PM, Scott Bigelow wrote:

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: Indexing 20M documents from MySQL with DIH

Reply via email to