Thanks for your response! I think the issue is that the records are being returned TOO fast from MySQL. I can dump them to CSV in about 30 minutes, but building the solr index takes hours on the system I'm using. I may just need to use a more powerful Solr instance so it doesn't leave MySQL hanging for too long?
What about autoCommit, does that factor in to your import strategy? 2011/4/21 Robert Gründler <rob...@dubture.com>: > we're indexing around 10M records from a mysql database into > a single solr core. > > The DataImportHandler needs to join 3 sub-entities to denormalize > the data. > > We've run into some troubles for the first 2 attempts, but setting > batchSize="-1" for the dataSource resolved the issues. > > Do you need a lot of complex joins to import the data from mysql? > > > > -robert > > > > > On 4/21/11 8:08 PM, Scott Bigelow wrote: >> >> I've been using Solr for a while now, indexing 2-4 million records >> using the DIH to pull data from MySQL, which has been working great. >> For a new project, I need to index about 20M records (30 fields) and I >> have been running into issues with MySQL disconnects, right around >> 15M. I've tried several remedies I've found on blogs, changing >> autoCommit, batchSize etc., and none of them have seem to majorly >> resolved the issue. It got me wondering: Is this the way everyone does >> it? What about 100M records up to 1B; are those all pulled using DIH >> and a single query? >> >> I've used sphinx in the past, which uses multiple queries to pull out >> a subset of records ranged based on PrimaryKey, does Solr offer >> functionality similar to this? It seems that once a Solr index gets to >> a certain size, the indexing of a batch takes longer than MySQL's >> net_write_timeout, so it kills the connection. >> >> Thanks for your help, I really enjoy using Solr and I look forward to >> indexing even more data! > >