alokchowdary created SPARK-27028:
------------------------------------

             Summary: PySpark read .dat file. Multiline issue
                 Key: SPARK-27028
                 URL: https://issues.apache.org/jira/browse/SPARK-27028
             Project: Spark
          Issue Type: Question
          Components: PySpark
    Affects Versions: 2.4.0
         Environment: Pyspark(2.4) in AWS EMR
            Reporter: alokchowdary


* I am trying to read the dat file using pyspark csv reader and it contains 
newline character ("\n") as part of the data. Spark is unable to read this file 
as single column, rather treating it as new row. I tried using the "multiLine" 
option while reading , but still its not working.

 * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}}

 * {{}}Data is something like this. Every line below is considered as row in 
dataframe.

 * Here  '\x01' is actual delimeter(but used , for ease of reading).
{{ }}

{{1. name,test,12345,}}
{{2. x, }}
{{3. desc }}
{{4. name2,test2,12345 }}
{{5. ,y}}
{{6. ,desc2}}

 * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls 
for other columns.

How to read such data in pyspark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to