Re: Indexing 20M documents from MySQL with DIH

Scott Bigelow Thu, 21 Apr 2011 15:34:16 -0700

Thanks for your response!

I think the issue is that the records are being returned TOO fast from
MySQL. I can dump them to CSV in about 30 minutes, but building the
solr index takes hours on the system I'm using. I may just need to use
a more powerful Solr instance so it doesn't leave MySQL hanging for
too long?


What about autoCommit, does that factor in to your import strategy?

2011/4/21 Robert Gründler <rob...@dubture.com>:
> we're indexing around 10M records from a mysql database into
> a single solr core.
>
> The DataImportHandler needs to join 3 sub-entities to denormalize
> the data.
>
> We've run into some troubles for the first 2 attempts, but setting
> batchSize="-1" for the dataSource resolved the issues.
>
> Do you need a lot of complex joins to import the data from mysql?
>
>
>
> -robert
>
>
>
>
> On 4/21/11 8:08 PM, Scott Bigelow wrote:
>>
>> I've been using Solr for a while now, indexing 2-4 million records
>> using the DIH to pull data from MySQL, which has been working great.
>> For a new project, I need to index about 20M records (30 fields) and I
>> have been running into issues with MySQL disconnects, right around
>> 15M. I've tried several remedies I've found on blogs, changing
>> autoCommit, batchSize etc., and none of them have seem to majorly
>> resolved the issue. It got me wondering: Is this the way everyone does
>> it? What about 100M records up to 1B; are those all pulled using DIH
>> and a single query?
>>
>> I've used sphinx in the past, which uses multiple queries to pull out
>> a subset of records ranged based on PrimaryKey, does Solr offer
>> functionality similar to this? It seems that once a Solr index gets to
>> a certain size, the indexing of a batch takes longer than MySQL's
>> net_write_timeout, so it kills the connection.
>>
>> Thanks for your help, I really enjoy using Solr and I look forward to
>> indexing even more data!
>
>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to