[ https://issues.apache.org/jira/browse/SPARK-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728143#comment-16728143 ]
Hyukjin Kwon commented on SPARK-26406: -------------------------------------- Spark allow RDD operations. You can also read it as text, skip few lines explicitly and load it via `csv(Dataset[String])` APIs. > Add option to skip rows when reading csv files > ---------------------------------------------- > > Key: SPARK-26406 > URL: https://issues.apache.org/jira/browse/SPARK-26406 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Thomas Kastl > Priority: Minor > > Real-world data can contain multiple header lines. Spark currently does not > offer any way to skip more than one header row. > Several workarounds are proposed on stackoverflow (manually editing each csv > file by adding "#" to the rows and using the comment option, or filtering > after reading) but all of them are workarounds with more or less obvious > drawbacks and restrictions. > The option > {code:java} > header=True{code} > already treats the first row of csv files differently, so the argument that > Spark wants to be row-order agnostic does not really hold here in my opinion. > A solution like pandas' > {code:java} > skiprows={code} > would be highly preferable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org