[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

ueshin Wed, 30 Jul 2014 09:19:08 -0700

Github user ueshin commented on the pull request:

    https://github.com/apache/spark/pull/1586#issuecomment-50637172
  
    Hi @javadba, FYI.
    I believe there are 3 types of "length" around string in Java/Scala.
    
    1) the number of 16-bit characters in the string
    To get this, use 
[`String#length`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length())
 like:
    
    ```scala
    scala> "\uF93D\uF936\uF949\uF942".length // chinese characters
    res0: Int = 4
    
    scala> "\uD840\uDC0B\uD842\uDFB7".length // 2 surrogate pairs
    res1: Int = 4
    
    scala> "1234567890ABC".length
    res2: Int = 13
    ```
    
    2) the number of code points in the string
    This will be covered by `Length`.
    To get this, use 
[`String#codePointCount`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#codePointCount(int,%20int))
 like:
    
    ```scala
    scala> "\uF93D\uF936\uF949\uF942".codePointCount(0, 4) // chinese characters
    res0: Int = 4
    
    scala> "\uD840\uDC0B\uD842\uDFB7".codePointCount(0, 4) // 2 surrogate pairs
    res1: Int = 2
    
    scala> "1234567890ABC".codePointCount(0, 13)
    res2: Int = 13
    ```
    
    3) the length of byte array encoded from string in some charset
    To get this, use 
[`String#getBytes(charset)`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes(java.lang.String))`.length`
 like:
    
    ```scala
    scala> "\uF93D\uF936\uF949\uF942".getBytes("utf8").length // chinese 
characters
    res0: Int = 12
    
    scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf8").length // 2 surrogate 
pairs
    res1: Int = 8
    
    scala> "1234567890ABC".getBytes("utf8").length
    res2: Int = 13
    
    scala> "\uF93D\uF936\uF949\uF942".getBytes("utf16").length // chinese 
characters
    res3: Int = 10
    
    scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf16").length // 2 surrogate 
pairs
    res4: Int = 10
    
    scala> "1234567890ABC".getBytes("utf16").length
    res5: Int = 28
    
    scala> "\uF93D\uF936\uF949\uF942".getBytes("utf32").length // chinese 
characters
    res6: Int = 16
    
    scala> "\uD840\uDC0B\uD842\uDFB7".getBytes("utf32").length // 2 surrogate 
pairs
    res7: Int = 8
    
    scala> "1234567890ABC".getBytes("utf32").length
    res8: Int = 52
    ```
    
    At first I guessed you wanted 3) for `Strlen` because charset related 
length is only 3), but I watched a conversation indicating another type of 
"length" and lost it halfway.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

Reply via email to