Re: Indexing 20M documents from MySQL with DIH

Scott Bigelow Thu, 21 Apr 2011 17:10:38 -0700

Thanks for the e-mail. I probably should have provided more details,
but I was more interested in making sure I was approaching the problem
correctly (using DIH, with one big SELECT statement for millions of
rows) instead of solving this specific problem. Here's a partial
stacktrace from this specific problem:


...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
        ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...


A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?

Thank you for your insight.


On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
<hossman_luc...@fucit.org> wrote:
>
> : For a new project, I need to index about 20M records (30 fields) and I
> : have been running into issues with MySQL disconnects, right around
> : 15M. I've tried several remedies I've found on blogs, changing
>
> if you can provide some concrete error/log messages and the details of how
> you are configuring your datasource that might help folks provide better
> suggestions -- youv'e said you run into a problem but you havne't provided
> any details for people to go on in giving you feedback.
>
> : resolved the issue. It got me wondering: Is this the way everyone does
> : it? What about 100M records up to 1B; are those all pulled using DIH
> : and a single query?
>
> I've only recently started using DIH, and while it definitely has a lot
> of quirks/anoyances, it seems like a pretty good 80/20 solution for
> indexing with Solr -- but that doens't mean it's perfect for all
> situations.
>
> Writing custom indexer code can certianly make sense in a lot of cases --
> particularly where you already have a data pblishing system that you wnat
> to tie into directly -- the trick is to ensure you have a decent strategy
> for rebuilding the entire index should the need arrise (but this is relaly
> only an issue if your primary indexing solution is incremental -- many use
> cases can be satisifed just fine with a brute force "full rebuild
> periodically" impelmentation.
>
>
> -Hoss
>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to