[ https://issues.apache.org/jira/browse/SPARK-20012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931277#comment-15931277 ]
Sean Owen commented on SPARK-20012: ----------------------------------- I don't understand the problem from this description. The headers must appear in the same order as data in the CSV file, of course. That works. What is your input, what is this input/output? A simple reproduction here is appropriate. > spark.read.csv schemas effectively ignore headers > ------------------------------------------------- > > Key: SPARK-20012 > URL: https://issues.apache.org/jira/browse/SPARK-20012 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.1.0 > Environment: pyspark > Reporter: david cottrell > Priority: Minor > > New to Spark, so please direct me elsewhere if there is another place for > this kind of discussion. > To my understanding, schema are ordered *named* structures however it seems > the names are not being used when reading files with headers. > I had a quick look at the DataFrameReader code and it seems like it might not > be too hard to > a) let the schema set the "global" order of the columns > b) per file, map the columns *by name* to the schema ordering and apply the > types on load. > A simple way of saying this is that the schema is an ordered dictionary and > the files with headers only define dictionaries. > A typical example showing what I think are the implications of this problem: > {code} > In [248]: a = spark.read.csv('./data/test.csv.gz', header=True, > inferSchema=True).toPandas() > In [249]: b = spark.read.csv('./data/0.csv.gz', header=True, > inferSchema=True).toPandas() > In [250]: d = pd.concat([a, b]) > In [251]: df = spark.read.csv('./data/{test,0}.csv.gz', header=True, > inferSchema=True).toPandas() > In [252]: df[['b', 'c', 'd', 'e']] = df[['b', 'c', 'd', 'e']].astype(float) > In [253]: a > Out[253]: > a b e d c > 0 test -0.874197 0.168660 -0.948726 0.479723 > 1 test 1.124383 0.620870 0.159186 0.993676 > 2 test -1.429108 -0.048814 -0.057273 -1.331702 > In [254]: b > Out[254]: > a b c d e > 0 0 -1.671828 -1.259530 0.905029 0.487244 > 1 0 -0.024553 -1.750904 0.004466 1.978049 > 2 0 1.686806 0.175431 0.677609 -0.851670 > In [255]: d > Out[255]: > a b c d e > 0 test -0.874197 0.479723 -0.948726 0.168660 > 1 test 1.124383 0.993676 0.159186 0.620870 > 2 test -1.429108 -1.331702 -0.057273 -0.048814 > 0 0 -1.671828 -1.259530 0.905029 0.487244 > 1 0 -0.024553 -1.750904 0.004466 1.978049 > 2 0 1.686806 0.175431 0.677609 -0.851670 > In [256]: df > Out[256]: > a b c d e > 0 test -0.874197 0.168660 -0.948726 0.479723 > 1 test 1.124383 0.620870 0.159186 0.993676 > 2 test -1.429108 -0.048814 -0.057273 -1.331702 > 3 0 -1.671828 -1.259530 0.905029 0.487244 > 4 0 -0.024553 -1.750904 0.004466 1.978049 > 5 0 1.686806 0.175431 0.677609 -0.851670 > {code} > Example also posted here: > http://stackoverflow.com/questions/42637497/pyspark-2-1-0-spark-read-csv-scrambles-columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org