Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/20894 > The case exists in all data source format, right? Not in all, for example, JSON datasource is more tolerant to field order in json records. Let's say if you have the schema: ``` val schema = new StructType().add("f1", IntegerType).add("f2", IntegerType) ``` you can read files from the same folder with different order of fields: *1.json* ``` {"f1":1, "f2":2} ``` *2.json* ``` {"f2":22, "f1":11} ``` ``` spark.read.schema(schema).json("json-dir") res0.show +---+---+ | f1| f2| +---+---+ | 11| 22| | 1| 2| +---+---+ ``` > If user didn't provide schema, should we check the header among CSV files? If the user didn't provide the schema, it will be inferred (if `inferSchema` is set, proper types will be inferred otherwise string types). So, the inferred schema will be verified against actual CSV headers with this changes. > Users should be responsible for the specifying data schema. Yes but it can be inferred too and checked during parsing. > The proposed behavior can only help users to avoid manually checking the CSV headers. Yes, this is the problem reported by our customers. They have multiple CSV files received from different sources. So, some files have different order of columns. And Spark returns wrong result silently. The expected behavior must be an error (with file name) or right result (data in columns must belong to right columns in loaded dataframe like in JSON datasource).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org