[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-10 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645996#comment-16645996
 ] 

Julian Hyde commented on CALCITE-2619:
--

I think the charset test is useful. Is there any way we can amortize its cost, 
eg caching the charset after it has been looked up, or using a special path for 
easy, common charsets, or short-cutting validation when a NlsString is copied?

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-11 Thread Ted Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646165#comment-16646165
 ] 

Ted Xu commented on CALCITE-2619:
-

Thanks [~julianhyde], I created a pull request at 
[https://github.com/apache/calcite/pull/883], supporting optional unsafe 
creation of NlsString.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-11 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647247#comment-16647247
 ] 

Julian Hyde commented on CALCITE-2619:
--

You've done one of those "let's drill a hole" changes. What about what I 
suggested, keeping the same functionality but making it cheaper?

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-11 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647257#comment-16647257
 ] 

Julian Hyde commented on CALCITE-2619:
--

By the way, what is the expensive part in NlsString? Is it Charset.forName, or 
encoder.encode, or SqlUtil.translateCharacterSetName, or something else?

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-11 Thread Ted Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647315#comment-16647315
 ] 

Ted Xu commented on CALCITE-2619:
-

As for the cost distribution, I did a quick test:

 
||Name||CPU time||Invocations||
|org.apache.calcite.sql.SqlUtil.translateCharacterSetName(String)|10.7s 
(0.1%)|16,089|
|java.nio.charset.CharsetEncoder.encode(java.nio.CharBuffer, 
java.nio.ByteBuffer, boolean)|1.374s (7.1%)|16,089 |

Charset.forName has its own cache so the cost can be ignored.

As for the improvements mentioned above:
 # Caching values been checked: we've considered the exact way, but looking up 
a string value from cache is still very expensive, not to mention the memory 
overhead of the cache.
 # Skip common charset verification:  [~julianhyde] can you elaborate more 
about this one? However, in CJK (China, Japan, Korea) countries UTF-8 is 
commonly adopted. We use UTF-8 as our default charset.
 # Skip copying verification: copy of NlsString changes the value, skip 
verification is still unsafe.

 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-12 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647560#comment-16647560
 ] 

Haisheng Yuan commented on CALCITE-2619:


[~julianhyde] Once the CharsetEncoder of the CharSet passes the encoding test, 
do we need to do the charset test every time when NlsString is created with the 
same charset? If not, we can definitely cache the Charset after it passes 
charset test, and skip the charset test if it can be found in the cache.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-12 Thread Haisheng Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647565#comment-16647565
 ] 

Haisheng Yuan commented on CALCITE-2619:


I am not worried about the memory overhead for caching charset object, since 
there are limited number of charsets.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-12 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648585#comment-16648585
 ] 

Julian Hyde commented on CALCITE-2619:
--

I think we should focus on the cost of validating whether a string is valid in 
UTF-8. (I probably have not seen this issue because I use Latin1 or something 
similar.) Is there a cheap general check?

I did a Google search and the top hit was by my friend [~lemire]: 
[https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/]
 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-13 Thread Daniel Lemire (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649007#comment-16649007
 ] 

Daniel Lemire commented on CALCITE-2619:


I am willing to help. I can help test, design, benchmark solutions.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-14 Thread Ted Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649663#comment-16649663
 ] 

Ted Xu commented on CALCITE-2619:
-

Great! Thanks [~julianhyde] and [~lemire]!

[~lemire] I'm assigning this Jira to you, please let me know if there is 
anything I can help.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Ted Xu
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-14 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649792#comment-16649792
 ] 

Julian Hyde commented on CALCITE-2619:
--

Probably good enough for our purposes is Guava’s 
[Utf8.isWellFormed|https://github.com/google/guava/blob/581ba1436ebaa54a7f5d0f1db8cc4da0ca72127e/guava/src/com/google/common/base/Utf8.java#L125]
 method. 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-15 Thread Daniel Lemire (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650255#comment-16650255
 ] 

Daniel Lemire commented on CALCITE-2619:


I'll do some evaluation of the problem (hopefully this week) and report back.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-16 Thread Daniel Lemire (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652404#comment-16652404
 ] 

Daniel Lemire commented on CALCITE-2619:


As for validating UTF-8 bytes, @julianhyde is probably right that Guava is a 
good choice... if one cares about performance in this instance...

 

 

[https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/]

 

Note that their function is going to be super fast if the input is ASCII... so 
it is good.

 

The issue posted here is different. I think. Let me comment on that separately.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-16 Thread Daniel Lemire (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652430#comment-16652430
 ] 

Daniel Lemire commented on CALCITE-2619:


The problem that I see in the code right now is that you take an input a Java 
String. Now, I would think that all Java strings can be represented as UTF-8 or 
UTF16. Is that not the case? If it is not the case, I want to know!

 

So I would think that something like this should be helpful...

 

 
{code:java}
if ((this.charset != StandardCharsets.UTF_8) && (this.charset != 
StandardCharsets.UTF_16)) {
  // verify
} else {
 // assume that the Java String is valid unicode
}
{code}
 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-17 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654385#comment-16654385
 ] 

Julian Hyde commented on CALCITE-2619:
--

It's very possible that the "payload" of an NlsString should be a byte array 
rather than a Java String.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-22 Thread Ted Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659943#comment-16659943
 ] 

Ted Xu commented on CALCITE-2619:
-

Thanks [~lemire] for the test and blog post. We should move forward by adopting 
Guava.

Daniel, are you going to give a patch in Calcite? Or else I'd like to 
contribute.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-10-24 Thread Daniel Lemire (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662880#comment-16662880
 ] 

Daniel Lemire commented on CALCITE-2619:


Ping me if you need help reviewing the code. 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-11-08 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680696#comment-16680696
 ] 

Julian Hyde commented on CALCITE-2619:
--

[~tedxu] Can you develop a patch? Can you also please close [PR 
883|https://github.com/apache/calcite/pull/883], since we're not going the 
'unsafe' route.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-11-08 Thread Ted Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680757#comment-16680757
 ] 

Ted Xu commented on CALCITE-2619:
-

[~julianhyde] sorry for the late reply, I was already working on this issue. 

 

However, the change is a bit larger than what I expected. I'd like to raise 
some more issue before I submit the patch:

1. I think the original verification of charset can only tell a Unicode string 
is LATIN1 encoded or not, since 'value' of NlsString is a Java String. I would 
change the 'value' type from String to byte[].

2. The payload of NlsString is a byte[] but we still need to cache an encoded 
String to reduce encoding cost. I would also like to have a method 
'getValueBytes() : byte[]' if someone need to skip encoding entirely. 

 

 

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check

2018-11-08 Thread Julian Hyde (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680805#comment-16680805
 ] 

Julian Hyde commented on CALCITE-2619:
--

That makes sense. I suggest to make the {{byte[]}} value private, and copy on 
creation, so that no one can mess with it. And make sure that the bytes are 
valid for the charset/encoding on creation.

Consider using an Avatica ByteString, which contains a {{byte[]}} internally 
but is immutable.

You should keep the current constructor, that uses a java.lang.String, for 
"simple" encodings like LATIN1 and UTF16.

> Reduce string literal creation cost by removing charset check
> -
>
> Key: CALCITE-2619
> URL: https://issues.apache.org/jira/browse/CALCITE-2619
> Project: Calcite
>  Issue Type: Improvement
>  Components: core
>Reporter: Ted Xu
>Assignee: Julian Hyde
>Priority: Major
>
> The cost of creating NlsString is very high, due to its charset check. In 
> some cases, e.g., expression evaluate because of Partition Prune, the 
> NlsString creation costs 40%+ of total executor's overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)