[ 
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23786.
-----------------------------
       Resolution: Fixed
         Assignee: Maxim Gekk
    Fix Version/s: 2.4.0

> CSV schema validation - column names are not checked
> ----------------------------------------------------
>
>                 Key: SPARK-23786
>                 URL: https://issues.apache.org/jira/browse/SPARK-23786
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Assignee: Maxim Gekk
>            Priority: Major
>             Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed 
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth", 
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +-----------+-----+
> |temperature|depth|
> +-----------+-----+
> |       10.2|  9.0|
> |        5.5| 12.3|
> +-----------+-----+
> {code}
> The correct behavior would be either output error or read columns according 
> its names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to