[ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413211#comment-15413211 ]
Natu Lauchande commented on SPARK-16896: ---------------------------------------- Quick confirmation needed : this change needs to happen in the https://github.com/databricks/spark-csv package if i understand correctly . Will start working on it later today, will come back with some more newbie questions. > Loading csv with duplicate column names > --------------------------------------- > > Key: SPARK-16896 > URL: https://issues.apache.org/jira/browse/SPARK-16896 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Aseem Bansal > > It would be great if the library allows us to load csv with duplicate column > names. I understand that having duplicate columns in the data is odd but > sometimes we get data that has duplicate columns. Getting upstream data like > that can happen. We may choose to ignore them but currently there is no way > to drop those as we are not able to load them at all. Currently as a > pre-processing I loaded the data into R, changed the column names and then > make a fixed version with which Spark Java API can work. > But if talk about other options, e.g. R has read.csv which automatically > takes care of such situation by appending a number to the column name. > Also case sensitivity in column names can also cause problems. I mean if we > have columns like > ColumnName, columnName > I may want to have them as separate. But the option to do this is not > documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org