Check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e This approach of not using sub entities really improved our load time.
Ephraim Ofir -----Original Message----- From: Robert Gründler [mailto:rob...@dubture.com] Sent: Wednesday, December 15, 2010 4:49 PM To: solr-user@lucene.apache.org Subject: Re: Dataimport performance i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > 2010/12/15 Robert Gründler <rob...@dubture.com>: >> The data-config.xml looks like this (only 1 entity): >> >> <entity name="track" query="select t.id as id, t.title as title, >> l.title as label from track t left join label l on (l.id = t.label_id) where >> t.deleted = 0" transformer="TemplateTransformer"> >> <field column="title" name="title_t" /> >> <field column="label" name="label_t" /> >> <field column="id" name="sf_meta_id" /> >> <field column="metaclass" template="Track" name="sf_meta_class"/> >> <field column="metaid" template="${track.id}" name="sf_meta_id"/> >> <field column="uniqueid" template="Track_${track.id}" >> name="sf_unique_id"/> >> >> <entity name="artists" query="select a.name as artist from artist a >> left join track_artist ta on (ta.artist_id = a.id) where >> ta.track_id=${track.id}"> >> <field column="artist" name="artists_t" /> >> </entity> >> >> </entity> > > So there's one track entity with an artist sub-entity. My (admittedly > rather limited) experience has been that sub-entities, where you have > to run a separate query for every row in the parent entity, really > slow down data import. For my own purposes, I wrote a custom data > import using SolrJ to improve the performance (from 3 hours to 10 > minutes). > > Just as a test, how long does it take if you comment out the artists entity?