[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645996#comment-16645996 ] Julian Hyde commented on CALCITE-2619: -- I think the charset test is useful. Is there any way we can amortize its cost, eg caching the charset after it has been looked up, or using a special path for easy, common charsets, or short-cutting validation when a NlsString is copied? > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646165#comment-16646165 ] Ted Xu commented on CALCITE-2619: - Thanks [~julianhyde], I created a pull request at [https://github.com/apache/calcite/pull/883], supporting optional unsafe creation of NlsString. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647247#comment-16647247 ] Julian Hyde commented on CALCITE-2619: -- You've done one of those "let's drill a hole" changes. What about what I suggested, keeping the same functionality but making it cheaper? > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647257#comment-16647257 ] Julian Hyde commented on CALCITE-2619: -- By the way, what is the expensive part in NlsString? Is it Charset.forName, or encoder.encode, or SqlUtil.translateCharacterSetName, or something else? > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647315#comment-16647315 ] Ted Xu commented on CALCITE-2619: - As for the cost distribution, I did a quick test: ||Name||CPU time||Invocations|| |org.apache.calcite.sql.SqlUtil.translateCharacterSetName(String)|10.7s (0.1%)|16,089| |java.nio.charset.CharsetEncoder.encode(java.nio.CharBuffer, java.nio.ByteBuffer, boolean)|1.374s (7.1%)|16,089 | Charset.forName has its own cache so the cost can be ignored. As for the improvements mentioned above: # Caching values been checked: we've considered the exact way, but looking up a string value from cache is still very expensive, not to mention the memory overhead of the cache. # Skip common charset verification: [~julianhyde] can you elaborate more about this one? However, in CJK (China, Japan, Korea) countries UTF-8 is commonly adopted. We use UTF-8 as our default charset. # Skip copying verification: copy of NlsString changes the value, skip verification is still unsafe. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647560#comment-16647560 ] Haisheng Yuan commented on CALCITE-2619: [~julianhyde] Once the CharsetEncoder of the CharSet passes the encoding test, do we need to do the charset test every time when NlsString is created with the same charset? If not, we can definitely cache the Charset after it passes charset test, and skip the charset test if it can be found in the cache. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647565#comment-16647565 ] Haisheng Yuan commented on CALCITE-2619: I am not worried about the memory overhead for caching charset object, since there are limited number of charsets. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648585#comment-16648585 ] Julian Hyde commented on CALCITE-2619: -- I think we should focus on the cost of validating whether a string is valid in UTF-8. (I probably have not seen this issue because I use Latin1 or something similar.) Is there a cheap general check? I did a Google search and the top hit was by my friend [~lemire]: [https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/] > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649007#comment-16649007 ] Daniel Lemire commented on CALCITE-2619: I am willing to help. I can help test, design, benchmark solutions. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649663#comment-16649663 ] Ted Xu commented on CALCITE-2619: - Great! Thanks [~julianhyde] and [~lemire]! [~lemire] I'm assigning this Jira to you, please let me know if there is anything I can help. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Ted Xu >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649792#comment-16649792 ] Julian Hyde commented on CALCITE-2619: -- Probably good enough for our purposes is Guava’s [Utf8.isWellFormed|https://github.com/google/guava/blob/581ba1436ebaa54a7f5d0f1db8cc4da0ca72127e/guava/src/com/google/common/base/Utf8.java#L125] method. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650255#comment-16650255 ] Daniel Lemire commented on CALCITE-2619: I'll do some evaluation of the problem (hopefully this week) and report back. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652404#comment-16652404 ] Daniel Lemire commented on CALCITE-2619: As for validating UTF-8 bytes, @julianhyde is probably right that Guava is a good choice... if one cares about performance in this instance... [https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-edition/] Note that their function is going to be super fast if the input is ASCII... so it is good. The issue posted here is different. I think. Let me comment on that separately. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652430#comment-16652430 ] Daniel Lemire commented on CALCITE-2619: The problem that I see in the code right now is that you take an input a Java String. Now, I would think that all Java strings can be represented as UTF-8 or UTF16. Is that not the case? If it is not the case, I want to know! So I would think that something like this should be helpful... {code:java} if ((this.charset != StandardCharsets.UTF_8) && (this.charset != StandardCharsets.UTF_16)) { // verify } else { // assume that the Java String is valid unicode } {code} > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654385#comment-16654385 ] Julian Hyde commented on CALCITE-2619: -- It's very possible that the "payload" of an NlsString should be a byte array rather than a Java String. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659943#comment-16659943 ] Ted Xu commented on CALCITE-2619: - Thanks [~lemire] for the test and blog post. We should move forward by adopting Guava. Daniel, are you going to give a patch in Calcite? Or else I'd like to contribute. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662880#comment-16662880 ] Daniel Lemire commented on CALCITE-2619: Ping me if you need help reviewing the code. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680696#comment-16680696 ] Julian Hyde commented on CALCITE-2619: -- [~tedxu] Can you develop a patch? Can you also please close [PR 883|https://github.com/apache/calcite/pull/883], since we're not going the 'unsafe' route. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680757#comment-16680757 ] Ted Xu commented on CALCITE-2619: - [~julianhyde] sorry for the late reply, I was already working on this issue. However, the change is a bit larger than what I expected. I'd like to raise some more issue before I submit the patch: 1. I think the original verification of charset can only tell a Unicode string is LATIN1 encoded or not, since 'value' of NlsString is a Java String. I would change the 'value' type from String to byte[]. 2. The payload of NlsString is a byte[] but we still need to cache an encoded String to reduce encoding cost. I would also like to have a method 'getValueBytes() : byte[]' if someone need to skip encoding entirely. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2619) Reduce string literal creation cost by removing charset check
[ https://issues.apache.org/jira/browse/CALCITE-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680805#comment-16680805 ] Julian Hyde commented on CALCITE-2619: -- That makes sense. I suggest to make the {{byte[]}} value private, and copy on creation, so that no one can mess with it. And make sure that the bytes are valid for the charset/encoding on creation. Consider using an Avatica ByteString, which contains a {{byte[]}} internally but is immutable. You should keep the current constructor, that uses a java.lang.String, for "simple" encodings like LATIN1 and UTF16. > Reduce string literal creation cost by removing charset check > - > > Key: CALCITE-2619 > URL: https://issues.apache.org/jira/browse/CALCITE-2619 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Ted Xu >Assignee: Julian Hyde >Priority: Major > > The cost of creating NlsString is very high, due to its charset check. In > some cases, e.g., expression evaluate because of Partition Prune, the > NlsString creation costs 40%+ of total executor's overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)