Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this specific problem:
... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a nightly thing, but something to do in case of a disaster? Is there a difference in the performance of an index that was built all at once vs. one that has had delta inserts and updates applied over a period of months? Thank you for your insight. On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > : For a new project, I need to index about 20M records (30 fields) and I > : have been running into issues with MySQL disconnects, right around > : 15M. I've tried several remedies I've found on blogs, changing > > if you can provide some concrete error/log messages and the details of how > you are configuring your datasource that might help folks provide better > suggestions -- youv'e said you run into a problem but you havne't provided > any details for people to go on in giving you feedback. > > : resolved the issue. It got me wondering: Is this the way everyone does > : it? What about 100M records up to 1B; are those all pulled using DIH > : and a single query? > > I've only recently started using DIH, and while it definitely has a lot > of quirks/anoyances, it seems like a pretty good 80/20 solution for > indexing with Solr -- but that doens't mean it's perfect for all > situations. > > Writing custom indexer code can certianly make sense in a lot of cases -- > particularly where you already have a data pblishing system that you wnat > to tie into directly -- the trick is to ensure you have a decent strategy > for rebuilding the entire index should the need arrise (but this is relaly > only an issue if your primary indexing solution is incremental -- many use > cases can be satisifed just fine with a brute force "full rebuild > periodically" impelmentation. > > > -Hoss >