I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query?
I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!