[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-20 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063313#comment-17063313
 ] 

Julian Reschke commented on OAK-8908:
-

Opened OAK-8963 for specifying the collation upon table creation.

This probably does not hurt, but does not seem to help wih the general perf 
issues found for MySQL and SQL Server, see OAK-8926.

> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-19 Thread Solomon Rutzky (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063085#comment-17063085
 ] 

Solomon Rutzky commented on OAK-8908:
-

[~reschke], there will be minimal difference between {{Latin1_General_CI_AS}} 
and {{Latin1_General_BIN}} when dealing with a relatively small number of rows, 
especially if the values are somewhat narrow. This is mainly an issue of scale. 
You won't see much impact in a typical dev or QA environment as they don't 
usually have millions of rows. Of course, to be fair, I don't know much about 
this project and how many rows are expected in a "large" installation. I am 
just sharing my knowledge that for hex data stored as a string, 
{{Latin1_General_BIN}} is ideal (if needing to guarantee support all the way 
back to SQL Server 2000, else one of the other two as noted in Item 2 of my 
first comment above). This holds true for any type of code or ID (when unable 
to store the value in a {{BINARY}} / {{VARBINARY}} datatype).

 

P.S. I did do some testing and learned something unexpected and interesting 
(well, interesting to me at least ;) ): there are two binary collations that 
_do_ have the same performance hit when used on a {{VARCHAR}} column that is 
indexed _and_ that column is filtered on a Unicode literal or variable:
 # {{SQL_Latin1_General_CP437_BIN2}}
 # {{SQL_Latin1_General_CP850_BIN2}}

That is definitely odd, especially given that the regular "_BIN" versions of 
those work just fine:
 # {{SQL_Latin1_General_CP437_BIN}}
 # {{SQL_Latin1_General_CP850_BIN}}

Either way, {{Latin1_General_BIN}} does work as expected: there's no 
performance hit when filtering on a Unicode literal or variable. This could be 
quite helpful to users who are not upgrading to the new code that filters on a 
non-Unicode value (i.e. using your {{ALTER}} script in the first comment 
above). Again, only caveat being if there is a need to mix casing of *0xA - 
0xF* .

 

> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-17 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061056#comment-17061056
 ] 

Julian Reschke commented on OAK-8908:
-

We can have DB-specific init statements, that's no problem.

That said, I believe I already tried that; and it didn't seem to have a 
measurable impact (which confused me a lot).



> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-17 Thread Solomon Rutzky (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061054#comment-17061054
 ] 

Solomon Rutzky commented on OAK-8908:
-

Fair enough. And upon further reflection on this, it seems that if the app 
currently deals with this value as a string, and does so with other supported 
database platforms, then switching the datatype to {{BINARY(32)}} or 
{{VARBINARY(32)}} would add more complication than it would be worth. And that 
leaves you with the optimal solution for this project being to force a binary 
collation on the two {{ID}} columns. I haven't had a chance to look at the code 
to see if you have the ability to handle different database platforms 
differently when creating the tables, but if you can, then that DDL should be 
updated to include the {{COLLATE binary_collation_name}} clause on the {{ID}} 
columns so that those columns don't pick up the default collation of the 
database. For existing deployments, the check you added in OAK-8918 should be 
updated to recommend changing the collation to a binary collation (via the 
T-SQL shown in the first comment above, but with a binary collation instead of 
{{Latin1_General_CI_AS}}).

The only possible complication to using a binary collation would be if any 
interaction with those two {{ID}} columns (whether in this app code, scripts, 
other potential tools that might query these tables, etc) is inconsistent in 
the casing of the hex digits *0xA - 0xF* (to be clear: because *a* != *A* when 
using a binary collation – a non-issue when using {{VARBINARY}} / {{BINARY}} 
datatypes, but alas, that's not the best option here).
 

Regarding benchmarks: I can't stop anyone from spending time on that, and I 
typically encourage doing performance tests myself. BUT, in this case, it would 
be safe to skip that step because it's not even theoretically possible that a 
multi-step process (that converts all bytes of the string into a sort key based 
on a multi-part weight per each character) would be faster than a 0-step 
process (that simply compares each byte).
 

Take care,
Solomon...
[https://SqlQuantumLift.com/|https://sqlquantumlift.com/]
[https://SqlQuantumLeap.com/|https://sqlquantumleap.com/]
[https://SQLsharp.com/|https://sqlsharp.com/]

 

> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-16 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1705#comment-1705
 ] 

Julian Reschke commented on OAK-8908:
-

Thanks for the feedback. Yes, there are multiple things we can do.

But:

1. Benchmarks first
2. We need to at least consider existing deployments

> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (OAK-8908) RDBBlobStore on SQL Server: bad performance when default collation is of type SQL*

2020-03-15 Thread Solomon Rutzky (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059698#comment-17059698
 ] 

Solomon Rutzky commented on OAK-8908:
-

Hi there. If this fix is to _not_ send a Unicode string if the column's 
datatype is not Unicode, then certainly this is an improvement. However, there 
are still two better options for improving performance in these scenarios:
 # Ideally, if the data is truly hex / binary data, then it should be stored in 
a {{BINARY}} / {{VARBINARY}} datatype instead of a string datatype. If the 
value is a consistent length/size, then the datatype should be {{BINARY(32)}}, 
else it should be {{VARBINARY(32)}}. This has two benefits:
 ## it cuts the storage size in half (which benefits performance, especially as 
the number of rows increases)
 ## comparison (i.e. filtering) and sorting operations are much faster (why 
suffer through culture-specific linguistic rules that will never be needed?)
 # If, for some reason, a binary datatype cannot be used (perhaps not all 
supported DB options offer a binary datatype), then at the very least a binary 
collation should be used. This has the following benefit: comparison (i.e. 
filtering) and sorting operations are much faster (why suffer through 
culture-specific linguistic rules that will never be needed?). The {{CREATE 
TABLE}} statements should be updated to include a {{COLLATE collation_name}} 
clause on the {{ID}} column so that the collation of the database is 
irrelevant. The particular binary collation to use depends on the minimum 
version of SQL Server that this project supports:
 ## For SQL Server 2000 and newer, use {{Latin1_General_BIN}} (_since you are 
only storing hex values, this option is probably fine_)
 ## For SQL Server 2005 and newer, use {{Latin1_General_BIN2}}
 ## For SQL Server 2008 and newer, use {{Latin1_General_100_BIN2}}

 

Take care,
 Solomon...
 [https://SqlQuantumLift.com/|https://sqlquantumlift.com/]
 [https://SqlQuantumLeap.com/|https://sqlquantumleap.com/]
 [https://SQLsharp.com/|https://sqlsharp.com/]

> RDBBlobStore on SQL Server: bad performance when default collation is of type 
> SQL*
> --
>
> Key: OAK-8908
> URL: https://issues.apache.org/jira/browse/OAK-8908
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: rdbmk
>Reporter: Julian Reschke
>Assignee: Julian Reschke
>Priority: Major
> Fix For: 1.28.0
>
> Attachments: OAK-8908-1.6.diff, OAK-8908.diff
>
>
> RDBBlobStore uses a 64-char primary key (digest in hex).
> Unfortunately, this causes performance issues on MS SQL Server, when the 
> collation for that column is of type "SQL*" (see links). These types of 
> collations are deprecated, but still the default for installations on the 
> "EN_US" locale.
> The performance loss can be observed by changing the collation on an existing 
> install, and then enable performance logging on RDBBlobStore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)