Thank you everyone for your help. I ended up getting the index to work using the exact same config file on a (substantially) larger instance.
On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <erickerick...@gmail.com> wrote: > {{{A custom indexer, so that's a fairly common practice? So when you are > dealing with these large indexes, do you try not to fully rebuild them > when you can? It's not a nightly thing, but something to do in case of > a disaster? Is there a difference in the performance of an index that > was built all at once vs. one that has had delta inserts and updates > applied over a period of months?}}} > > Is it a common practice? Like all of this, "it depends". It's certainly > easier to let DIH do the work. Sometimes DIH doesn't have all the > capabilities necessary. Or as Chris said, in the case where you already > have a system built up and it's easier to just grab the output from > that and send it to Solr, perhaps with SolrJ and not use DIH. Some people > are just more comfortable with their own code... > > "Do you try not to fully rebuild". It depends on how painful a full rebuild > is. Some people just like the simplicity of starting over every > day/week/month. > But you *have* to be able to rebuild your index in case of disaster, and > a periodic full rebuild certainly keeps that process up to date. > > "Is there a difference...delta inserts...updates...applied over months". Not > if you do an optimize. When a document is deleted (or updated), it's only > marked as deleted. The associated data is still in the index. Optimize will > reclaim that space and compact the segments, perhaps down to one. > But there's no real operational difference between a newly-rebuilt index > and one that's been optimized. If you don't delete/update, there's not > much reason to optimize either.... > > I'll leave the DIH to others...... > > Best > Erick > > On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <eph...@gmail.com> wrote: >> Thanks for the e-mail. I probably should have provided more details, >> but I was more interested in making sure I was approaching the problem >> correctly (using DIH, with one big SELECT statement for millions of >> rows) instead of solving this specific problem. Here's a partial >> stacktrace from this specific problem: >> >> ... >> Caused by: java.io.EOFException: Can not read response from server. >> Expected to read 4 bytes, read 0 bytes before connection was >> unexpectedly lost. >> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >> ... 22 more >> Apr 21, 2011 3:53:28 AM >> org.apache.solr.handler.dataimport.EntityProcessorBase getNext >> SEVERE: getNext() failed for query 'REDACTED' >> org.apache.solr.handler.dataimport.DataImportHandlerException: >> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >> Communications link failure >> >> The last packet successfully received from the server was 128 >> milliseconds ago. The last packet sent successfully to the server was >> 25,273,484 milliseconds ago. >> ... >> >> >> A custom indexer, so that's a fairly common practice? So when you are >> dealing with these large indexes, do you try not to fully rebuild them >> when you can? It's not a nightly thing, but something to do in case of >> a disaster? Is there a difference in the performance of an index that >> was built all at once vs. one that has had delta inserts and updates >> applied over a period of months? >> >> Thank you for your insight. >> >> >> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter >> <hossman_luc...@fucit.org> wrote: >>> >>> : For a new project, I need to index about 20M records (30 fields) and I >>> : have been running into issues with MySQL disconnects, right around >>> : 15M. I've tried several remedies I've found on blogs, changing >>> >>> if you can provide some concrete error/log messages and the details of how >>> you are configuring your datasource that might help folks provide better >>> suggestions -- youv'e said you run into a problem but you havne't provided >>> any details for people to go on in giving you feedback. >>> >>> : resolved the issue. It got me wondering: Is this the way everyone does >>> : it? What about 100M records up to 1B; are those all pulled using DIH >>> : and a single query? >>> >>> I've only recently started using DIH, and while it definitely has a lot >>> of quirks/anoyances, it seems like a pretty good 80/20 solution for >>> indexing with Solr -- but that doens't mean it's perfect for all >>> situations. >>> >>> Writing custom indexer code can certianly make sense in a lot of cases -- >>> particularly where you already have a data pblishing system that you wnat >>> to tie into directly -- the trick is to ensure you have a decent strategy >>> for rebuilding the entire index should the need arrise (but this is relaly >>> only an issue if your primary indexing solution is incremental -- many use >>> cases can be satisifed just fine with a brute force "full rebuild >>> periodically" impelmentation. >>> >>> >>> -Hoss >>> >> >