Github user javadba commented on the pull request:

    https://github.com/apache/spark/pull/1586#issuecomment-50484674
  
    @ueshin  Thanks for the thoughtful reviews. Specifically for the length() 
function we should emulate hive. Here is the hive logic converted to scala:
    
        def len(s : String)  = {
          if (s == null) {
            null
          } else {
            @inline def isUtfStartByte(b : Byte) = (b & 0xC0) != 0x80
            s.getBytes.foldLeft(0) { case (cnt, b) => {
              cnt + (if (isUtfStartByte(b)) 1 else 0)
            }
            }
          }
        }
    
    So 
        len("\u0123abc")
        
    returns 4
    
    The  length() method should be pretty  incontroversial at this point.
    
    Now the strlen() method still requires some judgement.  The intention is to 
be able to obtain different results for the string length based on the provided 
encoding.
    
    Here is my proposed canonical example: 
    
         checkEvaluation(Strlen(Literal("\uF93D\uF936\uF949\uF942", 
StringType), "UTF-8"), 4)
         checkEvaluation(Strlen(Literal("\uF93D\uF936\uF949\uF942", 
StringType), "UTF-16"), 2)
         checkEvaluation(Strlen(Literal("\uF93D\uF936\uF949\uF942", 
StringType), "UTF-32"), 1)
    
    So the question outstanding is:
    
    How to do platform independent conversion of the representation (bytes?) of 
a string "\uF93D\uF936\uF949\uF942" to obtain above results?
    
    The len() helper function above does NOT work that way. On my Ubuntu it 
gives answer "3" whereas the CentOS VM gives answer "1".
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to