[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files

Willi Raschkowski (Jira) Mon, 06 Feb 2023 05:43:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684683#comment-17684683
 ]


Willi Raschkowski commented on SPARK-42359:
-------------------------------------------

In our experience such CSV files tend to be Excel exports where users like to 
populate rows above the header with descriptions of the data.

To give a real-world example: [here's a dataset made available by the UK 
government 
(data.gov.uk)|https://www.data.gov.uk/dataset/9003012e-4564-4a6b-b5f0-8765ccb23a03/average-road-fuel-sales-deliveries-and-stock-levels].
 The dataset is only available via Excel files that look like this:

!Screenshot 2023-02-06 at 13.23.34.png!

Exporting from Excel for consumption in Spark results in a CSV that looks like 
this:
{code}
cat 
~/Downloads/20230202_Average_road_fuel_sales_deliveries_and_stock_levels.csv  | 
head -n 15 | cut -c1-150
"Average road fuel deliveries at sampled filling stations: United Kingdom, from 
27 January 2020 [note 1][note 2][note 3]",,,,,,,,,,,,,,,,,,,,,,,,,,,,
This worksheet contains one table. Some cells refer to notes which can be found 
in the notes worksheet.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"Freeze panes are turned on. To turn off freeze panes select the 'View' ribbon 
then 'Freeze Panes' then 'Unfreeze Panes' or use [Alt,W,F]",,,,,,,,,,,,
Source: 
BEIS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Released: 02 February 
2023,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Return to 
contents,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Units: Volume in 
litres,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Date,Weekday,Fuel Type,North East,North West,Yorkshire and The Humber,"East
Midlands","West
Midlands",East,London,South East,South West,Northern 
Ireland,Wales,Scotland,"England
[note 3]",United 
Kingdom,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
27/01/2020,Monday,Diesel," 10,583 "," 9,422 "," 11,687 "," 11,205 "," 11,353 
"," 10,284 "," 7,501 "," 10,023 "," 9,535 "," 8,511 "," 9,961 "," 9,600 "
28/01/2020,Tuesday,Diesel," 11,643 "," 10,440 "," 13,172 "," 11,885 "," 12,943 
"," 12,255 "," 7,310 "," 10,106 "," 11,144 "," 7,740 "," 10,306 "," 10,
29/01/2020,Wednesday,Diesel," 10,839 "," 10,021 "," 11,417 "," 12,195 "," 
11,370 "," 12,542 "," 8,102 "," 11,235 "," 10,840 "," 6,943 "," 11,532 "," 9
30/01/2020,Thursday,Diesel," 8,808 "," 10,673 "," 11,871 "," 13,469 "," 12,727 
"," 12,445 "," 7,708 "," 11,044 "," 9,741 "," 7,456 "," 10,647 "," 10,2
{code}

> Support row skipping when reading CSV files
> -------------------------------------------
>
>                 Key: SPARK-42359
>                 URL: https://issues.apache.org/jira/browse/SPARK-42359
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.1
>            Reporter: Willi Raschkowski
>            Priority: Major
>         Attachments: Screenshot 2023-02-06 at 13.23.34.png
>
>
> Spark currently can't read CSV files that contain lines with comments or 
> annotations above the header and data. Work-arounds include pre-processing 
> CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these 
> increase friction for less technical users.
> This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a 
> number of unwanted lines at the top of a CSV file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files

Reply via email to