Amrit Sarkar created SOLR-12854:
-----------------------------------

             Summary: Document steps to improve delta import via 
DataImportHandler 
                 Key: SOLR-12854
                 URL: https://issues.apache.org/jira/browse/SOLR-12854
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: contrib - DataImportHandler
    Affects Versions: 7.5
            Reporter: Amrit Sarkar


Delta imports in DataImportHandler is sometimes slower than full imports where 
the delta import makes multiple queries compare to full import and hence making 
it time complex. Listed in: 
https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

In the mailing list; 
http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-td4338162.html
 one of the Solr users have noted a workaround which works perfectly and 
improves delta import performance, where we need to specify 
${dataimporter.last_index_time} in the delta_import_query, and not 
delta_sql_query.

{code}
I found a hacky way to limit the number of 
times deltaImportQuery was executed.

As designed, solr executes deltaQuery to get a list of ids that need to be 
indexed. For each of those, it executes deltaImportQuery, which is typically 
very similar to the full query.

I constructed a deltaQuery to purposely only return 1 row. E.g.

deltaQuery = "SELECT id FROM table WHERE rownum=1" // written for 
oracle, likely requires a different syntax for other dbs. Also, it occurred 
to you could probably include the date>= '${dataimporter.last_index_time}' 
filter here so this returns 0 rows if no data has changed

Since deltaImportQuery now *only gets called once I needed to add the filter 
logic to *deltaImportQuery *to only select the changed rows (that logic is 
normally in *deltaQuery). E.g.

deltaImportQuery = [normal import query] WHERE date >= 
'${dataimporter.last_index_time}'
{code}

A number of other users have adopted the strategy and DIH delta import 
performance has improved, and henceforth documenting this strategy as TIP will 
help other users too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to