[jira] [Commented] (DERBY-6858) Apache Derby simple update statement performance becomes 1500% worse when adding one byte to a column

Mike Matrigali (JIRA) Sat, 20 Feb 2016 17:21:27 -0800

    [ 
https://issues.apache.org/jira/browse/DERBY-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155838#comment-15155838
 ]


Mike Matrigali commented on DERBY-6858:
---------------------------------------

I am coming late to this discussion.  mostly what i looked at were the 2 
different plans presented at the beginning of the issue.  Note that the plan 
info posted in this JIRA is incomplete
and hides important information that I did see tin the link below:
The formatting on this link make easier to read:  
http://stackoverflow.com/questions/34874762/apache-derby-simple-update-statement-performance-becomes-1500-worse-when-adding
I am confused about that TESTINDEX is in the posted plans as i do not see that 
mentioned in other discussions or in the posted java test code.

Also up front i am going to say i have no idea what affect the foreign key 
stuff it going to have, if i was working on this I would eliminate it as much 
as possible to 
identify if that is a problem.  

What jumped out at me was that they were 2 different plans.  Often in derby if 
something is order of magnitude problem it is often an optimizer issue.  
I think the test case is updating every row of the table.  

Note that while the query looks simple, it is actually one of the problem cases 
for a database.  update ChildUpdate set parentName = 'Parent 2' where 
parentName = 'Parent 1' .  This
is updating the same data that it is looking at to perform the update.  Derby 
does not have anyway to track what updates it has done in the table for the 
given transaction so it has 2
options 1) make sure it only visits each row once and does or does not do the 
update or 2) do they updates and track the result in a side table and then 
apply them one by one after
making the list.  option 2 is referred as deferred update, and option 1 is just 
deferred update=false.   

The first "fast" plan looks like it is not using any index and just doing the 
full scan of the table and updating every row.  And if you know 
you are going to change every row that is probably the best you can do.  And it 
turns out that if you pick that plan to find the rows you want to update then
you can also do updates in place (ie. just change the data in the table as you 
visit each row).   The optmizer output says that it is going to do this in the 
line:  deferred: false

The second "slow" plan has chosen to use the index to find the rows it wants to 
look for.  An unfortunate choice as we know it is going to have to visit all 
2500 rows.   I think the index
being used has name as the key.  But if there were 10 million rows and only 1 
matching update to do it would be much faster.  Using this index to do updates 
you must use "deferred" mode because  if you update the key you might visit the 
row again later in the index (at least in general case of updating same column 
you are filtering on).  

So changing all rows using deferred mode vs direct mode is going to always be 
slower.  Changing some subset of the rows may be faster than full table scan, 
that is where optimizer 
comes in.

So some questions for the "real" application:
1) In real application do you always change all the rows with this update?  
Does the test case actually represent the likely real app scenario?
2) what is lifetime of data and size?  ie. do you load a bunch and then number 
and data distribution stays stable?  

Now why the 1byte change affects all of this.   The optmizer uses costs to pick 
plans.  A component of these costs is the size of the data.  My guess is that 
adding one byte made the table have more pages which then changed the cost 
estimate to make the index lookup look better.  The optmizer costing is a known 
problem and optimizer expertise on the project has been
a problem for some time now.  In this case I would not be surprised if the 
extra cost of doing a deferred vs direct update is not calculated at all - it 
may be a different part of the code looks
at the chosen plan and then decides if it can be deferred or not - i don't know.

Some things to play with to understand more what is going on:
1) get sizes of the underlying tables and indexes in the good and bad cases.  
https://db.apache.org/derby/docs/10.12/ref/rrefsyscsdiagspacetable.html
2) see if updating optimizer statistics helps.  Derby will eventually 
automatically do this in background.  But for your test case it seems likely 
you created
    indexes first and loaded data which will not generate stats.  : 
https://db.apache.org/derby/docs/10.8/tuning/ctunperfstatistics.html
3) i need some help from others on this one, but you may be able to add hints 
to your update query  if you want to work around this problem and force the 
scan.
4) you might try writing your query with ? parameters vs hard coded variables.  
This takes a different path through the optimizer.  I am not hopeful on this 
one as
    i think it will lean more toward index scan vs table scan.

> Apache Derby simple update statement performance becomes 1500% worse when 
> adding one byte to a column
> -----------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-6858
>                 URL: https://issues.apache.org/jira/browse/DERBY-6858
>             Project: Derby
>          Issue Type: Bug
>    Affects Versions: 10.11.1.1, 10.12.1.1
>         Environment: windows 7 64 bit
>            Reporter: somebody
>            Priority: Blocker
>         Attachments: repro.java, repro.java, repro.java
>
>
> I have 2 tables as follows:
> ParentUpdate
> name varchar(255)
> value int not null
> primary key: name
> ChildUpdate
> parentName varchar(255)
> name varchar(255)
> value int
> data varchar(1000)
> primary key: name foreign key: parentName to ParentUpdate.name
> When I run the statement "update ChildUpdate set parentName = 'Parent 2' 
> where parentName = 'Parent 1'" with 2500 records in the ChildUpdate table and 
> 1 record in the ParentUpdate table with only a single byte difference in data 
> size in the ChildUpdate table, the performance decreases by 15 times.
> When the ChildUpdate data column has exactly 14 bytes of the same character 
> the runtime of the above query is about 500 milliseconds. When I add one more 
> byte to the data column of ChildUpdate the performance all of a sudden 
> becomes about 7500 milliseconds.
> If i then decrease the data size back to 14 from 15 it's fast again. When i 
> put it back to 15 it's slow again. This is reproducible every time.
> Can you please help me figure out how to get the same fast performance 
> without such seemingly random behaviour.
> The query plans are below for both cases.
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 16
>             close time (milliseconds) = 16
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count: 51.50
>             optimizer estimated cost: 796.12
>         Source result set:
>             Table Scan ResultSet for CHILDUPDATE at read committed isolation 
> level using exclusive row locking chosen by the optimizer
>             Number of opens = 1
>             Rows seen = 2500
>             Rows filtered = 0
>             Fetch Size = 1
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 15
>                 next time (milliseconds) = 16
>                 close time (milliseconds) = 16
>                 next time in milliseconds/row = 0
>             scan information:
>                 Bit set of columns fetched={0, 1}
>                 Number of columns fetched=2
>                 Number of pages visited=41
>                 Number of rows qualified=2500
>                 Number of rows visited=2500
>                 Scan type=heap
>                 start position:
>                     null
>                 stop position:
>                     null
>                 qualifiers:
>                     Column[0][0] Id: 0
>                     Operator: =
>                     Ordered nulls: false
>                     Unknown return value: false
>                     Negate comparison result: false
>                 optimizer estimated row count: 51.50
>                 optimizer estimated cost: 796.12
> total time: ~500 milliseconds
> and the slow version
>    Statement Name: 
>     null
> Statement Text: 
>     update ChildUpdate set parentName = 'Parent 2' where parentName = 'Parent 
> 1'
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 0
> Generate Time: 0
> Compile Time: 0
> Execute Time: -1453199485700
> Begin Compilation Timestamp : 2016-01-19 05:31:25.684
> End Compilation Timestamp : 2016-01-19 05:31:25.684
> Begin Execution Timestamp : 2016-01-19 05:31:25.7
> End Execution Timestamp : 2016-01-19 05:31:33.141
> Statement Execution Plan Text: 
> Update ResultSet using row locking:
> deferred: true
> Rows updated = 2500
> Indexes updated = 2
> Execute Time = -1453199485747
>     Normalize ResultSet:
>     Number of opens = 1
>     Rows seen = 2500
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 0
>         next time (milliseconds) = 47
>         close time (milliseconds) = 0
>         optimizer estimated row count: 51.50
>         optimizer estimated cost: 810.94
>     Source result set:
>         Project-Restrict ResultSet (3):
>         Number of opens = 1
>         Rows seen = 2500
>         Rows filtered = 0
>         restriction = false
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 32
>             close time (milliseconds) = 0
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count: 51.50
>             optimizer estimated cost: 810.94
>         Source result set:
>             Project-Restrict ResultSet (2):
>             Number of opens = 1
>             Rows seen = 2500
>             Rows filtered = 0
>             restriction = false
>             projection = true
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 32
>                 close time (milliseconds) = 0
>                 restriction time (milliseconds) = 0
>                 projection time (milliseconds) = 0
>                 optimizer estimated row count: 51.50
>                 optimizer estimated cost: 810.94
>             Source result set:
>                 Index Scan ResultSet for CHILDUPDATE using index TESTINDEX at 
> read committed isolation level using exclusive row locking chosen by the 
> optimizer
>                 Number of opens = 1
>                 Rows seen = 2500
>                 Rows filtered = 0
>                 Fetch Size = 1
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 32
>                     close time (milliseconds) = 0
>                     next time in milliseconds/row = 0
>                 scan information:
>                     Bit set of columns fetched={0, 1, 2}
>                     Number of columns fetched=3
>                     Number of deleted rows visited=0
>                     Number of pages visited=42
>                     Number of rows qualified=2500
>                     Number of rows visited=2500
>                     Scan type=btree
>                     Tree height=2
>                     start position:
>                         None
>                     stop position:
>                         None
>                     qualifiers:
>                         Column[0][0] Id: 1
>                         Operator: =
>                         Ordered nulls: false
>                         Unknown return value: false
>                         Negate comparison result: false
>                     optimizer estimated row count: 51.50
>                     optimizer estimated cost: 810.94
> total time: ~7 seconds 500 milliseconds
> please also see post:
> http://stackoverflow.com/questions/34874762/apache-derby-simple-update-statement-performance-becomes-1500-worse-when-adding



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DERBY-6858) Apache Derby simple update statement performance becomes 1500% worse when adding one byte to a column

Reply via email to