[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

HyukjinKwon Sat, 25 Aug 2018 05:33:45 -0700

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22104
  
    I mean, the current code will still break partitioned tables:
    
    ```
    == Physical Plan ==
    *(3) Project [_c0#223, pythonUDF0#231 AS v1#226]
    +- BatchEvalPython [<lambda>(0)], [_c0#223, pythonUDF0#231]
       +- *(2) Project [_c0#223]
          +- *(2) Filter (pythonUDF0#230 = 0)
             +- BatchEvalPython [<lambda>(0)], [_c0#223, pythonUDF0#230]
                +- *(1) FileScan csv [_c0#223] Batched: false, Format: CSV, 
Location: InMemoryFileIndex[file:/tmp/tab3], PartitionFilters: [(<lambda>(0) = 
0)], PushedFilters: [], ReadSchema: struct<_c0:string>
    ```
    
    For instance:
    
    ```python
    from pyspark.sql.functions import udf, lit, col
    
    spark.range(1).selectExpr("id", "id as 
value").write.mode("overwrite").format('csv').partitionBy("id").save("/tmp/tab3")
    df = spark.read.csv('/tmp/tab3')
    df2 = df.withColumn('v1', udf(lambda x: x, 'int')(lit(0)))
    df2 = df2.filter(df2['v1'] == 0)
    
    df2.explain()
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...

Reply via email to