Oscar Brück created SPARK-31991:
-----------------------------------

             Summary: Regexp_replace causing problems
                 Key: SPARK-31991
                 URL: https://issues.apache.org/jira/browse/SPARK-31991
             Project: Spark
          Issue Type: Bug
          Components: Java API
    Affects Versions: 2.4.5
            Reporter: Oscar Brück


The SparkR regex functions are not working as they should. Here's a reprex.

 
{code:java}
# Load packages

library(tidyverse)
library(sparklyr)
library(SparkR)


# Create data
df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test1 = regexp_replace(test, "Less ", "<"),
 test1 = regexp_replace(test1, "A1", "<1"),
 test1 = regexp_replace(test1, "Over ", ">"),
 test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
 test3 = regexp_replace(test2, "[\\,]", "aa"),
 test3 = regexp_replace(test3, " ", ""))
df2 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test = regexp_replace(test, "Less ", "<"),
 test = regexp_replace(test, "A1", "<1"),
 test = regexp_replace(test, "Over ", ">"),
 test = regexp_replace(test, "[a-zA-Z]+", ""),
 test = regexp_replace(test, "[\\,]", "aa"),
 test = regexp_replace(test, " ", ""))
# Collect and print
df1_1 <- df1 %>% as.data.frame()
df1_1
df2_1 <- df2 %>% as.data.frame()
df2_1
{code}
The column test3 in df1_1 is correct but the column test in df2_1 is not, 
although the input, regex patterns and the replacements are identical.

I find SparkR really great, but would be eager to use R regex instead of java 
regex.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to