The quote options seem to be related to escaping quotes and the dataset isn't escaaping quotes. As I said quoted strings with embedded commas is something that pandas handles easily, and even Excel does that as well.
Femi On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Hi Femi, > > Have you maybe tried the quote related options specified in the > documentation? > > http://spark.apache.org/docs/latest/api/python/pyspark.sql. > html#pyspark.sql.DataFrameReader.csv > > Thanks. > > 2016-11-06 6:58 GMT+09:00 Femi Anthony <femib...@gmail.com>: > >> Hi, I am trying to process a very large comma delimited csv file and I am >> running into problems. >> The main problem is that some fields contain quoted strings with embedded >> commas. >> It seems as if PySpark is unable to properly parse lines containing such >> fields like say Pandas does. >> >> Here is the code I am using to read the file in Pyspark >> >> df_raw=spark.read.option("header","true").csv(csv_path) >> >> Here is an example of a good and 'bad' line in such a file: >> >> >> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col >> 12,col13,col14,col15,col16,col17,col18,col19 >> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY >> ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y >> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023 >> -05-17,CODERED >> 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT >> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201 >> 9-11-23,CODEBLUE >> >> Line 0 is the header >> Line 1 is the 'problematic' line >> Line 2 is a good line. >> >> Pandas can handle this easily: >> >> >> [1]: import pandas as pd >> >> In [2]: pdf = pd.read_csv('malformed_data.csv') >> >> In [4]: pdf[['col12','col13','col14']] >> Out[4]: >> col12 >> col13 \ >> 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUG >> OTSOUL~BRINGDANOISE >> 1 NaN >> OUTKAST#THROOTS~WUTANG#RUNDMC >> >> col14 >> 0 23.0 >> 1 0.0 >> >> >> while Pyspark seems to parse this erroneously: >> >> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed >> _data.csv',header=True) >> >> [6]: sdf.select("col12","col13",'col14').show() >> +------------------+--------------------+--------------------+ >> | col12| col13| col14| >> +------------------+--------------------+--------------------+ >> |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...| >> | null|OUTKAST#THROOTS~W...| 0.0| >> +------------------+--------------------+--------------------+ >> >> Is this a bug or am I doing something wrong ? >> I am working with Spark 2.0 >> Any help is appreciated >> >> Thanks, >> -- Femi >> >> http://www.nextmatrix.com >> "Great spirits have always encountered violent opposition from mediocre >> minds." - Albert Einstein. >> > > -- http://www.femibyte.com/twiki5/bin/view/Tech/ http://www.nextmatrix.com "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein.