[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696325#comment-14696325 ]
Yadong Qi commented on SPARK-9213: ---------------------------------- 1. Use Joni regex instead of Java regex, I only need to replace java function. 2. Add Joni regex and keep Java regex, I need to define abstract engine trait(JavaRegexEngine/JoniRegexEngine), and due to the return type of their functions(for example, matcher(), they all return Matcher, but one is java.util.regex.Matcher, and the other is org.joni.Matcher) are not the same, I need to rebuild some codes. I'll try 2 first, like Java/Kryo in serializable. > Improve regular expression performance (via joni) > ------------------------------------------------- > > Key: SPARK-9213 > URL: https://issues.apache.org/jira/browse/SPARK-9213 > Project: Spark > Issue Type: Umbrella > Components: SQL > Reporter: Reynold Xin > > I'm creating an umbrella ticket to improve regular expression performance for > string expressions. Right now our use of regular expressions is inefficient > for two reasons: > 1. Java regex in general is slow. > 2. We have to convert everything from UTF8 encoded bytes into Java String, > and then run regex on it, and then convert it back. > There are libraries in Java that provide regex support directly on UTF8 > encoded bytes. One prominent example is joni, used in JRuby. > Note: all regex functions are in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org