GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/16361

    [SPARK-18952] Regex strings not properly escaped in codegen for aggregations

    ## What changes were proposed in this pull request?
    
    If I use the function regexp_extract, and then in my regex string, use `\`, 
i.e. escape character, this fails codegen, because the `\` character is not 
properly escaped when codegen'd.
    
    Example stack trace:
    ```
    /* 059 */     private int maxSteps = 2;
    /* 060 */     private int numRows = 0;
    /* 061 */     private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
    /* 062 */     .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
1)", org.apache.spark.sql.types.DataTypes.StringType);
    /* 063 */     private org.apache.spark.sql.types.StructType valueSchema = 
new org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.LongType);
    /* 064 */     private Object emptyVBase;
    
    ...
    
    org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
62, Column 58: Invalid escape sequence
        at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
        at org.codehaus.janino.Scanner.produce(Scanner.java:604)
        at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
        at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
        at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
        at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
        at 
org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
        at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
    ```
    
    In the codegend expression, the literal should use `\\` instead of `\`
    
    A similar problem was solved here: 
https://github.com/apache/spark/pull/15156 for security reasons.
    
    ## How was this patch tested?
    
    Regression test in `DataFrameAggregationSuite`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark reg-break

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16361
    
----
commit 88e29bb72a69bdc095f5af616b4505664599d22e
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-12-20T22:51:40Z

    Save

commit b8582048729a154339e9c24d7d9c055c47f0eb62
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-12-20T23:12:41Z

    Fixed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to