[jira] [Commented] (SPARK-26406) Add option to skip rows when reading csv files

Hyukjin Kwon (JIRA) Sun, 23 Dec 2018 21:02:17 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728143#comment-16728143
 ]


Hyukjin Kwon commented on SPARK-26406:
--------------------------------------

Spark allow RDD operations. You can also read it as text, skip few lines 
explicitly and load it via `csv(Dataset[String])` APIs.

> Add option to skip rows when reading csv files
> ----------------------------------------------
>
>                 Key: SPARK-26406
>                 URL: https://issues.apache.org/jira/browse/SPARK-26406
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Thomas Kastl
>            Priority: Minor
>
> Real-world data can contain multiple header lines. Spark currently does not 
> offer any way to skip more than one header row.
> Several workarounds are proposed on stackoverflow (manually editing each csv 
> file by adding "#" to the rows and using the comment option, or filtering 
> after reading) but all of them are workarounds with more or less obvious 
> drawbacks and restrictions.
> The option
> {code:java}
> header=True{code}
> already treats the first row of csv files differently, so the argument that 
> Spark wants to be row-order agnostic does not really hold here in my opinion. 
> A solution like pandas'
> {code:java}
> skiprows={code}
> would be highly preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26406) Add option to skip rows when reading csv files

Reply via email to