[ https://issues.apache.org/jira/browse/SPARK-31991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oscar Brück updated SPARK-31991: -------------------------------- Description: The SparkR regex functions in R are not working as they should. Here's a reprex. {code:java} # Load packages library(tidyverse) library(sparklyr) library(SparkR) # Create data df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa")) # Transfer data to Spark memory df <- copy_to(sc, df, "df", overwrite = TRUE) # Modify data df1 <- df %>% dplyr::mutate( test = as.character(test), test1 = regexp_replace(test, "Less ", "<"), test1 = regexp_replace(test1, "A1", "<1"), test1 = regexp_replace(test1, "Over ", ">"), test2 = regexp_replace(test1, "[a-zA-Z]+", ""), test3 = regexp_replace(test2, "[\\,]", "aa"), test3 = regexp_replace(test3, " ", "")) df2 <- df %>% dplyr::mutate( test = as.character(test), test = regexp_replace(test, "Less ", "<"), test = regexp_replace(test, "A1", "<1"), test = regexp_replace(test, "Over ", ">"), test = regexp_replace(test, "[a-zA-Z]+", ""), test = regexp_replace(test, "[\\,]", "aa"), test = regexp_replace(test, " ", "")) # Collect and print df1_1 <- df1 %>% as.data.frame() df1_1 df2_1 <- df2 %>% as.data.frame() df2_1 {code} The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical. I find SparkR really great, but would be eager to use R regex instead of java regex. was: The SparkR regex functions are not working as they should. Here's a reprex. {code:java} # Load packages library(tidyverse) library(sparklyr) library(SparkR) # Create data df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa")) # Transfer data to Spark memory df <- copy_to(sc, df, "df", overwrite = TRUE) # Modify data df1 <- df %>% dplyr::mutate( test = as.character(test), test1 = regexp_replace(test, "Less ", "<"), test1 = regexp_replace(test1, "A1", "<1"), test1 = regexp_replace(test1, "Over ", ">"), test2 = regexp_replace(test1, "[a-zA-Z]+", ""), test3 = regexp_replace(test2, "[\\,]", "aa"), test3 = regexp_replace(test3, " ", "")) df2 <- df %>% dplyr::mutate( test = as.character(test), test = regexp_replace(test, "Less ", "<"), test = regexp_replace(test, "A1", "<1"), test = regexp_replace(test, "Over ", ">"), test = regexp_replace(test, "[a-zA-Z]+", ""), test = regexp_replace(test, "[\\,]", "aa"), test = regexp_replace(test, " ", "")) # Collect and print df1_1 <- df1 %>% as.data.frame() df1_1 df2_1 <- df2 %>% as.data.frame() df2_1 {code} The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical. I find SparkR really great, but would be eager to use R regex instead of java regex. Summary: The SparkR regexp_replace function causes problems (was: Regexp_replace causing problems) > The SparkR regexp_replace function causes problems > -------------------------------------------------- > > Key: SPARK-31991 > URL: https://issues.apache.org/jira/browse/SPARK-31991 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.4.5 > Reporter: Oscar Brück > Priority: Major > > The SparkR regex functions in R are not working as they should. Here's a > reprex. > > {code:java} > # Load packages > library(tidyverse) > library(sparklyr) > library(SparkR) > # Create data > df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa")) > # Transfer data to Spark memory > df <- copy_to(sc, df, "df", overwrite = TRUE) > # Modify data > df1 <- df %>% > dplyr::mutate( > test = as.character(test), > test1 = regexp_replace(test, "Less ", "<"), > test1 = regexp_replace(test1, "A1", "<1"), > test1 = regexp_replace(test1, "Over ", ">"), > test2 = regexp_replace(test1, "[a-zA-Z]+", ""), > test3 = regexp_replace(test2, "[\\,]", "aa"), > test3 = regexp_replace(test3, " ", "")) > df2 <- df %>% > dplyr::mutate( > test = as.character(test), > test = regexp_replace(test, "Less ", "<"), > test = regexp_replace(test, "A1", "<1"), > test = regexp_replace(test, "Over ", ">"), > test = regexp_replace(test, "[a-zA-Z]+", ""), > test = regexp_replace(test, "[\\,]", "aa"), > test = regexp_replace(test, " ", "")) > # Collect and print > df1_1 <- df1 %>% as.data.frame() > df1_1 > df2_1 <- df2 %>% as.data.frame() > df2_1 > {code} > The column test3 in df1_1 is correct but the column test in df2_1 is not, > although the input, regex patterns and the replacements are identical. > I find SparkR really great, but would be eager to use R regex instead of java > regex. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org