Re: Indexing 20M documents from MySQL with DIH

Scott Bigelow Sun, 24 Apr 2011 17:00:33 -0700

Thank you everyone for your help. I ended up getting the index to work
using the exact same config file on a (substantially) larger instance.


On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> {{{A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?}}}
>
> Is it a common practice? Like all of this, "it depends". It's certainly
> easier to let DIH do the work. Sometimes DIH doesn't have all the
> capabilities necessary. Or as Chris said, in the case where you already
> have a system built up and it's easier to just grab the output from
> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
> are just more comfortable with their own code...
>
> "Do you try not to fully rebuild". It depends on how painful a full rebuild
> is. Some people just like the simplicity of starting over every 
> day/week/month.
> But you *have* to be able to rebuild your index in case of disaster, and
> a periodic full rebuild certainly keeps that process up to date.
>
> "Is there a difference...delta inserts...updates...applied over months". Not
> if you do an optimize. When a document is deleted (or updated), it's only
> marked as deleted. The associated data is still in the index. Optimize will
> reclaim that space and compact the segments, perhaps down to one.
> But there's no real operational difference between a newly-rebuilt index
> and one that's been optimized. If you don't delete/update, there's not
> much reason to optimize either....
>
> I'll leave the DIH to others......
>
> Best
> Erick
>
> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <eph...@gmail.com> wrote:
>> Thanks for the e-mail. I probably should have provided more details,
>> but I was more interested in making sure I was approaching the problem
>> correctly (using DIH, with one big SELECT statement for millions of
>> rows) instead of solving this specific problem. Here's a partial
>> stacktrace from this specific problem:
>>
>> ...
>> Caused by: java.io.EOFException: Can not read response from server.
>> Expected to read 4 bytes, read 0 bytes before connection was
>> unexpectedly lost.
>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>        ... 22 more
>> Apr 21, 2011 3:53:28 AM
>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>> SEVERE: getNext() failed for query 'REDACTED'
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>>
>> The last packet successfully received from the server was 128
>> milliseconds ago.  The last packet sent successfully to the server was
>> 25,273,484 milliseconds ago.
>> ...
>>
>>
>> A custom indexer, so that's a fairly common practice? So when you are
>> dealing with these large indexes, do you try not to fully rebuild them
>> when you can? It's not a nightly thing, but something to do in case of
>> a disaster? Is there a difference in the performance of an index that
>> was built all at once vs. one that has had delta inserts and updates
>> applied over a period of months?
>>
>> Thank you for your insight.
>>
>>
>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>> <hossman_luc...@fucit.org> wrote:
>>>
>>> : For a new project, I need to index about 20M records (30 fields) and I
>>> : have been running into issues with MySQL disconnects, right around
>>> : 15M. I've tried several remedies I've found on blogs, changing
>>>
>>> if you can provide some concrete error/log messages and the details of how
>>> you are configuring your datasource that might help folks provide better
>>> suggestions -- youv'e said you run into a problem but you havne't provided
>>> any details for people to go on in giving you feedback.
>>>
>>> : resolved the issue. It got me wondering: Is this the way everyone does
>>> : it? What about 100M records up to 1B; are those all pulled using DIH
>>> : and a single query?
>>>
>>> I've only recently started using DIH, and while it definitely has a lot
>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>>> indexing with Solr -- but that doens't mean it's perfect for all
>>> situations.
>>>
>>> Writing custom indexer code can certianly make sense in a lot of cases --
>>> particularly where you already have a data pblishing system that you wnat
>>> to tie into directly -- the trick is to ensure you have a decent strategy
>>> for rebuilding the entire index should the need arrise (but this is relaly
>>> only an issue if your primary indexing solution is incremental -- many use
>>> cases can be satisifed just fine with a brute force "full rebuild
>>> periodically" impelmentation.
>>>
>>>
>>> -Hoss
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to