Oscar Brück created SPARK-31991: ----------------------------------- Summary: Regexp_replace causing problems Key: SPARK-31991 URL: https://issues.apache.org/jira/browse/SPARK-31991 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.5 Reporter: Oscar Brück
The SparkR regex functions are not working as they should. Here's a reprex. {code:java} # Load packages library(tidyverse) library(sparklyr) library(SparkR) # Create data df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa")) # Transfer data to Spark memory df <- copy_to(sc, df, "df", overwrite = TRUE) # Modify data df1 <- df %>% dplyr::mutate( test = as.character(test), test1 = regexp_replace(test, "Less ", "<"), test1 = regexp_replace(test1, "A1", "<1"), test1 = regexp_replace(test1, "Over ", ">"), test2 = regexp_replace(test1, "[a-zA-Z]+", ""), test3 = regexp_replace(test2, "[\\,]", "aa"), test3 = regexp_replace(test3, " ", "")) df2 <- df %>% dplyr::mutate( test = as.character(test), test = regexp_replace(test, "Less ", "<"), test = regexp_replace(test, "A1", "<1"), test = regexp_replace(test, "Over ", ">"), test = regexp_replace(test, "[a-zA-Z]+", ""), test = regexp_replace(test, "[\\,]", "aa"), test = regexp_replace(test, " ", "")) # Collect and print df1_1 <- df1 %>% as.data.frame() df1_1 df2_1 <- df2 %>% as.data.frame() df2_1 {code} The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical. I find SparkR really great, but would be eager to use R regex instead of java regex. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org