Github user xuejianbest commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22048#discussion_r214778257
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
           }
         }
       }
    +
    +  /**
    +   * Regular expression matching full width characters
    +   */
    +  private val fullWidthRegex = ("""[""" +
    +    // scalastyle:off nonascii
    +    """\u1100-\u115F""" +
    +    """\u2E80-\uA4CF""" +
    +    """\uAC00-\uD7A3""" +
    +    """\uF900-\uFAFF""" +
    +    """\uFE10-\uFE19""" +
    +    """\uFE30-\uFE6F""" +
    +    """\uFF00-\uFF60""" +
    +    """\uFFE0-\uFFE6""" +
    --- End diff --
    
    > Can you describe them there and put a references to a public unicode 
document?
    
    This is a regular expression match using unicode, regardless of the 
specific encoding.
    For example, the following string is encoded using gbk instead of utf8, and 
the match still works:
    `
        val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 
0xFA.toByte)
        val s1 = new String(bytes, "gbk")
        
        println(s1) //中国
        
        val fullWidthRegex = ("""[""" +
        // scalastyle:off nonascii
        """\u1100-\u115F""" +
        """\u2E80-\uA4CF""" +
        """\uAC00-\uD7A3""" +
        """\uF900-\uFAFF""" +
        """\uFE10-\uFE19""" +
        """\uFE30-\uFE6F""" +
        """\uFF00-\uFF60""" +
        """\uFFE0-\uFFE6""" +
        // scalastyle:on nonascii
        """]""").r
        
        println(fullWidthRegex.findAllIn(s1).size) //2
    `
    This regular expression is obtained experimentally under a specific font.
    I don't understand what you are going to do.
    
    
    > How about some additional overheads when calling showString as compared 
to showString w/o this patch?
    
    I tested a Dataset consisting of 100 rows, each row has two columns, one 
column is the index (0-99), and the other column is a random string of length 
100 characters, and then the showString display is called separately.
    The original showString method (w/o this patch) took about 42ms, and the 
improved time took about 46ms, and the performance was about 10% worse.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to