[GitHub] spark pull request #21296: [SPARK-24244][SQL] CSV column pruning

MaxGekk Fri, 11 May 2018 05:54:52 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21296#discussion_r187604963
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
    @@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
             .options(Map("header" -> "true", "mode" -> "dropmalformed"))
             .load(testFile(carsFile))
     
    -      assert(cars.select("year").collect().size === 2)
    +      assert(cars.collect().size === 2)
    --- End diff --
    
    > it's intendedly parsed to keep the backword compatibility. 
    
    Right, by selecting all columns I force *UnivocityParser* to fall to the 
case:
    
https://github.com/MaxGekk/spark-1/blob/a4a0a549156a15011c33c7877a35f244d75b7a4f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L193-L213
    when number of returned tokens are less than required.
    
    In the case of `cars.select("year")`, uniVocity parser returns only one 
token as it is expected.
    
    > There was an issue about the different number of counts. 
    
    The PR changes behavior for some malformed inputs but I believe we could 
provide better performance for users who have correct inputs.
    
    > I think you are basically saying cars.select("year").collect().size and 
cars.collect().size are different and they are correct, right?
    
    Yes, you can say that. You are right it seems the PR proposes another 
interpretation for malformed rows. `cars.select("year")` is:
    
    ```
    +----+
    |year|
    +----+
    |2012|
    |1997|
    |2015|
    +----+
    ``` 
     and we should not reject `2015` only because there are problems in not 
requested columns. In this particular case, the last row consists of only one 
value at `0` position and it is correct.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21296: [SPARK-24244][SQL] CSV column pruning

Reply via email to