[ https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684683#comment-17684683 ]
Willi Raschkowski commented on SPARK-42359: ------------------------------------------- In our experience such CSV files tend to be Excel exports where users like to populate rows above the header with descriptions of the data. To give a real-world example: [here's a dataset made available by the UK government (data.gov.uk)|https://www.data.gov.uk/dataset/9003012e-4564-4a6b-b5f0-8765ccb23a03/average-road-fuel-sales-deliveries-and-stock-levels]. The dataset is only available via Excel files that look like this: !Screenshot 2023-02-06 at 13.23.34.png! Exporting from Excel for consumption in Spark results in a CSV that looks like this: {code} cat ~/Downloads/20230202_Average_road_fuel_sales_deliveries_and_stock_levels.csv | head -n 15 | cut -c1-150 "Average road fuel deliveries at sampled filling stations: United Kingdom, from 27 January 2020 [note 1][note 2][note 3]",,,,,,,,,,,,,,,,,,,,,,,,,,,, This worksheet contains one table. Some cells refer to notes which can be found in the notes worksheet.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, "Freeze panes are turned on. To turn off freeze panes select the 'View' ribbon then 'Freeze Panes' then 'Unfreeze Panes' or use [Alt,W,F]",,,,,,,,,,,, Source: BEIS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Released: 02 February 2023,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Return to contents,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Units: Volume in litres,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Date,Weekday,Fuel Type,North East,North West,Yorkshire and The Humber,"East Midlands","West Midlands",East,London,South East,South West,Northern Ireland,Wales,Scotland,"England [note 3]",United Kingdom,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 27/01/2020,Monday,Diesel," 10,583 "," 9,422 "," 11,687 "," 11,205 "," 11,353 "," 10,284 "," 7,501 "," 10,023 "," 9,535 "," 8,511 "," 9,961 "," 9,600 " 28/01/2020,Tuesday,Diesel," 11,643 "," 10,440 "," 13,172 "," 11,885 "," 12,943 "," 12,255 "," 7,310 "," 10,106 "," 11,144 "," 7,740 "," 10,306 "," 10, 29/01/2020,Wednesday,Diesel," 10,839 "," 10,021 "," 11,417 "," 12,195 "," 11,370 "," 12,542 "," 8,102 "," 11,235 "," 10,840 "," 6,943 "," 11,532 "," 9 30/01/2020,Thursday,Diesel," 8,808 "," 10,673 "," 11,871 "," 13,469 "," 12,727 "," 12,445 "," 7,708 "," 11,044 "," 9,741 "," 7,456 "," 10,647 "," 10,2 {code} > Support row skipping when reading CSV files > ------------------------------------------- > > Key: SPARK-42359 > URL: https://issues.apache.org/jira/browse/SPARK-42359 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.1 > Reporter: Willi Raschkowski > Priority: Major > Attachments: Screenshot 2023-02-06 at 13.23.34.png > > > Spark currently can't read CSV files that contain lines with comments or > annotations above the header and data. Work-arounds include pre-processing > CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these > increase friction for less technical users. > This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a > number of unwanted lines at the top of a CSV file. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org