[ 
https://issues.apache.org/jira/browse/SPARK-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Kastl updated SPARK-26406:
---------------------------------
    Description: 
Real-world data can contain multiple header lines. Spark currently does not 
offer any way to skip more than one header row.

Several workarounds are proposed on stackoverflow (manually editing each csv 
file by adding "#" to the rows and using the comment option, or filtering after 
reading) but all of them are workarounds with more or less obvious drawbacks 
and restrictions.

The option
{code:java}
header=True{code}
already treats the first row of csv files differently, so the argument that 
Spark wants to be row-order agnostic does not really hold here in my opinion. A 
solution like pandas
{code:java}
skiprows={code}
would be highly preferable.

  was:
Real-world data can contain multiple header lines. Spark currently does not 
offer any way to skip more than one header row.

Several workarounds are proposed on stackoverflow (manually editing each csv 
file by adding "#" to the rows and using the comment option, or filtering after 
reading) but all of them are workarounds with more or less obvious drawbacks 
and restrictions.

The option
{code:java}
header=True{code}
already treats the first row of csv files differently, so the argument that 
Spark wants to be row-agnostic does not really hold here in my opinion. A 
solution like pandas
{code:java}
skiprows={code}
would be highly preferable.


> Add option to skip rows when reading csv files
> ----------------------------------------------
>
>                 Key: SPARK-26406
>                 URL: https://issues.apache.org/jira/browse/SPARK-26406
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Thomas Kastl
>            Priority: Minor
>
> Real-world data can contain multiple header lines. Spark currently does not 
> offer any way to skip more than one header row.
> Several workarounds are proposed on stackoverflow (manually editing each csv 
> file by adding "#" to the rows and using the comment option, or filtering 
> after reading) but all of them are workarounds with more or less obvious 
> drawbacks and restrictions.
> The option
> {code:java}
> header=True{code}
> already treats the first row of csv files differently, so the argument that 
> Spark wants to be row-order agnostic does not really hold here in my opinion. 
> A solution like pandas
> {code:java}
> skiprows={code}
> would be highly preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to