[ 
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477445#comment-13477445
 ] 

Shawn Heisey commented on SOLR-3954:
------------------------------------

Here's a direct comparison on the same hardware.  It might be important to know 
that when my import gets kicked off, there are actually four imports running.  
One of them is small -- during the second test (updateLog off), it imported 
687765 rows in 10 minutes and 08 seconds.  I did not check how long it took 
during the first test.  The other three imports are all nearly 13 million 
records each.

A du on the completed index directory with 12.9 million records shows 23520900 
KB.

I ran the first test and grabbed stats after an hour.  Then I killed Solr, 
commented out updateLog, started it up again, kicked off the full-import, and 
again grabbed stats after an hour.  Comparing the two shows that it is about 
twice as fast with updateLog turned off.

With updateLog turned on:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
</lst>
<lst name="initArgs">
  <lst name="defaults">
    <str name="config">dih-config.xml</str>
  </lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
  <str name="Time Elapsed">1:0:1.762</str>
  <str name="Total Requests made to DataSource">1</str>
  <str name="Total Rows Fetched">2052096</str>
  <str name="Total Documents Processed">2052095</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2012-10-16 14:59:01</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to 
change in the future.</str>
</response>
{code}

With updateLog turned off:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
</lst>
<lst name="initArgs">
  <lst name="defaults">
    <str name="config">dih-config.xml</str>
  </lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
  <str name="Time Elapsed">1:0:0.434</str>
  <str name="Total Requests made to DataSource">1</str>
  <str name="Total Rows Fetched">4167525</str>
  <str name="Total Documents Processed">4167524</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2012-10-16 16:05:01</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to 
change in the future.</str>
</response>
{code}

                
> Option to have updateHandler and DIH skip updateLog
> ---------------------------------------------------
>
>                 Key: SOLR-3954
>                 URL: https://issues.apache.org/jira/browse/SOLR-3954
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 4.0
>            Reporter: Shawn Heisey
>             Fix For: 4.1
>
>
> The updateLog feature makes updates take longer, likely because of the I/O 
> time required to write the additional information to disk.  It may take as 
> much as three times as long for the indexing portion of the process.  I'm not 
> sure whether it affects the time to commit, but I would imagine that the 
> difference there is small or zero.  When doing incremental updates/deletes on 
> an existing index, the time lag is probably very small and unimportant.
> When doing a full reindex (which may happen via DIH), especially if this is 
> done in a build core that is then swapped with a live core, this performance 
> hit is unacceptable.  It seems to make the import take about three times as 
> long.
> An option to have an update skip the updateLog would be very useful for these 
> situations.  It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to