[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504016#comment-17504016 ]
tonydoen edited comment on SPARK-9213 at 3/10/22, 5:58 AM: ----------------------------------------------------------- [~rxin] [~waterman] [~mridulm80] hi, we recently have met some problems about regex. I want to continue the work on this (via joni). Any suggests I will appreciate. was (Author: JIRAUSER285351): [~rxin] [~waterman] [~mridulm80] > Improve regular expression performance (via joni) > ------------------------------------------------- > > Key: SPARK-9213 > URL: https://issues.apache.org/jira/browse/SPARK-9213 > Project: Spark > Issue Type: Umbrella > Components: SQL > Reporter: Reynold Xin > Priority: Major > Labels: bulk-closed > > I'm creating an umbrella ticket to improve regular expression performance for > string expressions. Right now our use of regular expressions is inefficient > for two reasons: > 1. Java regex in general is slow. > 2. We have to convert everything from UTF8 encoded bytes into Java String, > and then run regex on it, and then convert it back. > There are libraries in Java that provide regex support directly on UTF8 > encoded bytes. One prominent example is joni, used in JRuby. > Note: all regex functions are in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org