Github user ueshin commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50637172 Hi @javadba, FYI. I believe there are 3 types of "length" around string in Java/Scala. 1) the number of 16-bit characters in the string To get this, use [`String#length`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length()) like: ```scala scala> "\uF93D\uF936\uF949\uF942".length // chinese characters res0: Int = 4 scala> "\uD840\uDC0B\uD842\uDFB7".length // 2 surrogate pairs res1: Int = 4 scala> "1234567890ABC".length res2: Int = 13 ``` 2) the number of code points in the string This will be covered by `Length`. To get this, use [`String#codePointCount`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int)) like: ```scala scala> "\uF93D\uF936\uF949\uF942".codePointCount(0, 4) // chinese characters res0: Int = 4 scala> "\uD840\uDC0B\uD842\uDFB7".codePointCount(0, 4) // 2 surrogate pairs res1: Int = 2 scala> "1234567890ABC".codePointCount(0, 13) res2: Int = 13 ``` 3) the length of byte array encoded from string in some charset To get this, use [`String#getBytes(charset)`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes(java.lang.String))`.length` like: ```scala scala> "\uF93D\uF936\uF949\uF942".getBytes("utf8").length // chinese characters res0: Int = 12 scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf8").length // 2 surrogate pairs res1: Int = 8 scala> "1234567890ABC".getBytes("utf8").length res2: Int = 13 scala> "\uF93D\uF936\uF949\uF942".getBytes("utf16").length // chinese characters res3: Int = 10 scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf16").length // 2 surrogate pairs res4: Int = 10 scala> "1234567890ABC".getBytes("utf16").length res5: Int = 28 scala> "\uF93D\uF936\uF949\uF942".getBytes("utf32").length // chinese characters res6: Int = 16 scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf32").length // 2 surrogate pairs res7: Int = 8 scala> "1234567890ABC".getBytes("utf32").length res8: Int = 52 ``` At first I guessed you wanted 3) for `Strlen` because charset related length is only 3), but I watched a conversation indicating another type of "length" and lost it halfway.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---