[jira] [Updated] (SPARK-31991) The SparkR regexp_replace function causes problems

Jira Sun, 14 Jun 2020 13:15:02 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oscar Brück updated SPARK-31991:
--------------------------------
    Description: 
The SparkR regex functions in R are not working as they should. Here's a reprex.

 
{code:java}
# Load packages

library(tidyverse)
library(sparklyr)
library(SparkR)


# Create data
df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test1 = regexp_replace(test, "Less ", "<"),
 test1 = regexp_replace(test1, "A1", "<1"),
 test1 = regexp_replace(test1, "Over ", ">"),
 test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
 test3 = regexp_replace(test2, "[\\,]", "aa"),
 test3 = regexp_replace(test3, " ", ""))
df2 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test = regexp_replace(test, "Less ", "<"),
 test = regexp_replace(test, "A1", "<1"),
 test = regexp_replace(test, "Over ", ">"),
 test = regexp_replace(test, "[a-zA-Z]+", ""),
 test = regexp_replace(test, "[\\,]", "aa"),
 test = regexp_replace(test, " ", ""))
# Collect and print
df1_1 <- df1 %>% as.data.frame()
df1_1
df2_1 <- df2 %>% as.data.frame()
df2_1
{code}
The column test3 in df1_1 is correct but the column test in df2_1 is not, 
although the input, regex patterns and the replacements are identical.

I find SparkR really great, but would be eager to use R regex instead of java 
regex.

 

 

  was:
The SparkR regex functions are not working as they should. Here's a reprex.

 
{code:java}
# Load packages

library(tidyverse)
library(sparklyr)
library(SparkR)


# Create data
df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test1 = regexp_replace(test, "Less ", "<"),
 test1 = regexp_replace(test1, "A1", "<1"),
 test1 = regexp_replace(test1, "Over ", ">"),
 test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
 test3 = regexp_replace(test2, "[\\,]", "aa"),
 test3 = regexp_replace(test3, " ", ""))
df2 <- df %>%
 dplyr::mutate(
 test = as.character(test),
 test = regexp_replace(test, "Less ", "<"),
 test = regexp_replace(test, "A1", "<1"),
 test = regexp_replace(test, "Over ", ">"),
 test = regexp_replace(test, "[a-zA-Z]+", ""),
 test = regexp_replace(test, "[\\,]", "aa"),
 test = regexp_replace(test, " ", ""))
# Collect and print
df1_1 <- df1 %>% as.data.frame()
df1_1
df2_1 <- df2 %>% as.data.frame()
df2_1
{code}
The column test3 in df1_1 is correct but the column test in df2_1 is not, 
although the input, regex patterns and the replacements are identical.

I find SparkR really great, but would be eager to use R regex instead of java 
regex.

 

 

        Summary: The SparkR regexp_replace function causes problems  (was: 
Regexp_replace causing problems)

> The SparkR regexp_replace function causes problems
> --------------------------------------------------
>
>                 Key: SPARK-31991
>                 URL: https://issues.apache.org/jira/browse/SPARK-31991
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.5
>            Reporter: Oscar Brück
>            Priority: Major
>
> The SparkR regex functions in R are not working as they should. Here's a 
> reprex.
>  
> {code:java}
> # Load packages
> library(tidyverse)
> library(sparklyr)
> library(SparkR)
> # Create data
> df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
> # Transfer data to Spark memory
> df <- copy_to(sc, df, "df", overwrite = TRUE)
> # Modify data
> df1 <- df %>%
>  dplyr::mutate(
>  test = as.character(test),
>  test1 = regexp_replace(test, "Less ", "<"),
>  test1 = regexp_replace(test1, "A1", "<1"),
>  test1 = regexp_replace(test1, "Over ", ">"),
>  test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
>  test3 = regexp_replace(test2, "[\\,]", "aa"),
>  test3 = regexp_replace(test3, " ", ""))
> df2 <- df %>%
>  dplyr::mutate(
>  test = as.character(test),
>  test = regexp_replace(test, "Less ", "<"),
>  test = regexp_replace(test, "A1", "<1"),
>  test = regexp_replace(test, "Over ", ">"),
>  test = regexp_replace(test, "[a-zA-Z]+", ""),
>  test = regexp_replace(test, "[\\,]", "aa"),
>  test = regexp_replace(test, " ", ""))
> # Collect and print
> df1_1 <- df1 %>% as.data.frame()
> df1_1
> df2_1 <- df2 %>% as.data.frame()
> df2_1
> {code}
> The column test3 in df1_1 is correct but the column test in df2_1 is not, 
> although the input, regex patterns and the replacements are identical.
> I find SparkR really great, but would be eager to use R regex instead of java 
> regex.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31991) The SparkR regexp_replace function causes problems

Reply via email to