See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2 DIH 
handlers.  The first would query the "CAT_TABLE" and save it to a disk-backed 
cache, using DIHCacheWriter.  You then would replace your 3 child entities in 
the 2nd DIH handler to use DIHCacheProcessor to read back the cached data.  
This is a little complicated to do, but it would let you just cache the data 
once and because it is disk-backed, will scale to whatever size the CAT_TABLE 
is.  (For some details, see this thread: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)

A simpler method is simply to specify "cacheImpl=SortedMapBackedCache" on the 3 
child entities.  (This is the same as using CachedSqlEntityProcessor.)  It 
would generate 3 in-memory caches, each with the same data.  If CAT_TABLE is 
small, this would be adequate.  

In between this would be to create a disk-backed cache Impl (or use the ones at 
SOLR-2613 or SOLR-2948) and specify it on "cacheImpl".  It would still create 3 
identical caches, but they would be disk-backed and could scale beyond what 
in-memory can handle.

James Dyer
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Thursday, May 16, 2013 11:01 AM
To: solr-user@lucene.apache.org
Subject: Speed up import of Hierarchical Data

I am using the DataImportHandler to Query a SQL Server and populate Solr.
Unfortunately, SQL does not have an understanding of hierarchical
relationships, and hence I use Table Joins. The following is an outline of
my table structure: 


PROD_TABLE
-> SKU (Primary Key)
-> Title  (varchar)
-> Descr (varchar)

CAT_TABLE
-> SKU (Foreign Key)
->  CategoryLevel (int i.e. 1, 2, 3 …)
-> CategoryName  (varchar)

I specify the SQL Query in the db-data-config.xml file – a snippet of which
looks like: 

<dataConfig>
    <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://localhost\...."/>
    <document>
        <entity name="Product" 
                                query="SELECT SKU, Title, Descr FROM 
PROD_TABLE">
            <field column="SKU" name="SKU" />
                        <field column="Title" name="Title" />
            <field column="Descr" name="Descr" />

                        <entity name="Cat1"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=1">
                                <field column="CategoryName" name="Category1" 
/> 
                        </entity>
                        <entity name="Cat2"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=2">
                                <field column="CategoryName" name="Category2" 
/> 
                        </entity>
                        <entity name="Cat3"  
                    query="SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=3">
                                <field column="CategoryName" name="Category3" 
/> 
                        </entity>
                        
        </entity>
    </document>
</dataConfig>

It seems like the DataImportHandler handler sends out three or four queries
for each Product. This results in a very slow import. Is there any way to
speed this up? I would not mind an intermediate step of first extracting SQL
and then putting it into Solr.

Thank you for all your help. 
O. O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to