[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696478#comment-14696478 ]
Reynold Xin commented on SPARK-9213: ------------------------------------ [~waterman] I don't think we need to worry too much about code reuse here. In the long run, I think we only need one implementation. However, in the short run, it'd be great to be able to feature flag the regular expression engine based on a SQL config flag. In order to do that, we can have two classes for each regex function. > Improve regular expression performance (via joni) > ------------------------------------------------- > > Key: SPARK-9213 > URL: https://issues.apache.org/jira/browse/SPARK-9213 > Project: Spark > Issue Type: Umbrella > Components: SQL > Reporter: Reynold Xin > > I'm creating an umbrella ticket to improve regular expression performance for > string expressions. Right now our use of regular expressions is inefficient > for two reasons: > 1. Java regex in general is slow. > 2. We have to convert everything from UTF8 encoded bytes into Java String, > and then run regex on it, and then convert it back. > There are libraries in Java that provide regex support directly on UTF8 > encoded bytes. One prominent example is joni, used in JRuby. > Note: all regex functions are in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org