Re: Dataimport performance

Tim Heckman Wed, 15 Dec 2010 07:12:13 -0800

The custom import I wrote is a java application that uses the SolrJ
library. Basically, where I had sub-entities in the DIH config I did
the mappings inside my java code.


1. Identify a subset or "chunk" of the primary id's to work on (so I
don't have to load everything into memory at once) and put those in a
temp table. I used a modulus on the id.
2. Select all of the outer entity from the database (joining on the
id's in the temp table), and load the data from that result set into
new solr input documents. I keep these in a hash map keyed on the
id's.
3. Then select all of the inner entity, joining on the id's from the
temp table. The result set has to include the id's from step 2. I go
through this result set and load the data into the matching solr input
documents from step 2.
4. Push that set of input documents to solr (optionally committing
them), then go back to step 1 using the next subset or chunk.

Not sure if this is the absolute best approach, but it's working well
enough for my specific case.

Tim


2010/12/15 Robert Gründler <rob...@dubture.com>:
> i've benchmarked the import already with 500k records, one time without the 
> artists subquery, and one time without the join in the main query:
>
>
> Without subquery: 500k in 3 min 30 sec
>
> Without join and without subquery: 500k in 2 min 30.
>
> With subquery and with left join:   320k in 6 Min 30
>
>
> so the joins / subqueries are definitely a bottleneck.
>
> How exactly did you implement the custom data import?
>
> In our case, we need to de-normalize the relations of the sql data for the 
> index,
> so i fear i can't really get rid of the join / subquery.
>
>
> -robert
>
>
>
>
>
> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>
>> 2010/12/15 Robert Gründler <rob...@dubture.com>:
>>> The data-config.xml looks like this (only 1 entity):
>>>
>>>      <entity name="track" query="select t.id as id, t.title as title, 
>>> l.title as label from track t left join label l on (l.id = t.label_id) 
>>> where t.deleted = 0" transformer="TemplateTransformer">
>>>        <field column="title" name="title_t" />
>>>        <field column="label" name="label_t" />
>>>        <field column="id" name="sf_meta_id" />
>>>        <field column="metaclass" template="Track" name="sf_meta_class"/>
>>>        <field column="metaid" template="${track.id}" name="sf_meta_id"/>
>>>        <field column="uniqueid" template="Track_${track.id}" 
>>> name="sf_unique_id"/>
>>>
>>>        <entity name="artists" query="select a.name as artist from artist a 
>>> left join track_artist ta on (ta.artist_id = a.id) where 
>>> ta.track_id=${track.id}">
>>>          <field column="artist" name="artists_t" />
>>>        </entity>
>>>
>>>      </entity>
>>
>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import. For my own purposes, I wrote a custom data
>> import using SolrJ to improve the performance (from 3 hours to 10
>> minutes).
>>
>> Just as a test, how long does it take if you comment out the artists entity?
>
>

Re: Dataimport performance

Reply via email to