[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

MaxGekk Wed, 02 May 2018 02:21:16 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20894
  
    > The case exists in all data source format, right?
    
    Not in all, for example, JSON datasource is more tolerant to field order in 
json records. Let's say if you have the schema:
    ```
    val schema = new StructType().add("f1", IntegerType).add("f2", IntegerType)
    ```
    you can read files from the same folder with different order of fields:
    *1.json*
    ```
    {"f1":1, "f2":2}
    ```
    *2.json*
    ```
    {"f2":22, "f1":11}
    ```
    ```
    spark.read.schema(schema).json("json-dir")
    res0.show
    +---+---+
    | f1| f2|
    +---+---+
    | 11| 22|
    |  1|  2|
    +---+---+
    ```
    > If user didn't provide schema, should we check the header among CSV files?
    
    If the user didn't provide the schema, it will be inferred (if 
`inferSchema` is set, proper types will be inferred otherwise string types). 
So, the inferred schema will be verified against actual CSV headers with this 
changes.
    
    > Users should be responsible for the specifying data schema. 
    
    Yes but it can be inferred too and checked during parsing.
    
    > The proposed behavior can only help users to avoid manually checking the 
CSV headers.
    
    Yes, this is the problem reported by our customers. They have multiple CSV 
files received from different sources. So, some files have different order of 
columns. And Spark returns wrong result silently. The expected behavior must be 
an error (with file name) or right result (data in columns must belong to right 
columns in loaded dataframe like in JSON datasource).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Reply via email to