Aseem Bansal created SPARK-16896:
------------------------------------

             Summary: Loading csv with duplicate column names
                 Key: SPARK-16896
                 URL: https://issues.apache.org/jira/browse/SPARK-16896
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.0.0
            Reporter: Aseem Bansal


It would be great if the library allows us to load csv with duplicate column 
names. I understand that having duplicate columns in the data is odd but 
sometimes we get data that has duplicate columns. Getting upstream data like 
that can happen. We may choose to ignore them but currently there is no way to 
drop those as we are not able to load them at all. Currently as a 
pre-processing I loaded the data into R, changed the column names and then make 
a fixed version with which Spark Java API can work.

But if talk about other options, e.g. R has read.csv which automatically takes 
care of such situation by appending a number to the column name.

Also case sensitivity in column names can also cause problems. I mean if we 
have columns like

ColumnName, columnName

I may want to have them as separate. But the option to do this is not 
documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to