> > Is the behavior of spark.read.csv not same as pyarrow csv?
No, I don't think the Python/C++ Arrow implementation has made any guarantees about compatibility with other engines CSV parsers, this include Pandas and Spark (I'd actually be shocked if any of them covered all possible edge cases in the same way). On Thu, Mar 17, 2022 at 4:39 PM Sricheta Ruj <[email protected]> wrote: > Thanks for replying Micah. > > But Spark is able to read this file without the commas for missing data. > > Is the behavior of spark.read.csv not same as pyarrow csv? > > > > This file is part of spark csv tests - > https://github.com/apache/spark/blob/f36a5fb2b88620c1c490d087b0293c4e58d29979/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L120 > > > > Thanks > > Sricheta. > > > > *From:* Micah Kornfield <[email protected]> > *Sent:* Thursday, March 17, 2022 4:28 PM > *To:* [email protected] > *Subject:* [EXTERNAL] Re: [pyarrow] CSV parse error: Expected 5 columns, > got 3 > > > > You don't often get email from [email protected]. Learn why this is > important <http://aka.ms/LearnAboutSenderIdentification> > > I believe the Arrow parser expects the last line to be: > > "2015,Chevy,Volt,," > > (i.e. have commas for the missing data). > > > > On Thu, Mar 17, 2022 at 3:23 PM Sricheta Ruj <[email protected]> > wrote: > > Hello. > > > > I am using pyarrow csv module. > > > > from pyarrow import csv > > fn = '/home/srruj/cars.csv' > > read_options=csv.ReadOptions(column_names=(‘year’, ‘make’, ‘model’, > ‘comment’, ‘blank’)) > > convert_options = csv.ConvertOptions(include_columns=column_names=(‘year’, > ‘make’, ‘model’, ‘comment’, ‘blank’), > > include_missing_columns=True, > > strings_can_be_null=True) > > > > table = csv.read_csv(fn, read_options=read_options, > convert_options=convert_options) > > table > > > > I am getting the following error : > > Csv parse error: Expected 5 columns, got 3 > > > > This is how file looks: > > > > year,make,model,comment,blank > > "2012","Tesla","S","No comment", > > 1997,Ford,E350,"Go get one now they are going fast", > > 2015,Chevy,Volt > > > > I am able to read this file from spark using spark.read.csv(..) but not > using pyarrow. > > > > Can you please help? > > > > Thanks > > Sricheta. > > > > > >
