alokchowdary created SPARK-27028: ------------------------------------ Summary: PySpark read .dat file. Multiline issue Key: SPARK-27028 URL: https://issues.apache.org/jira/browse/SPARK-27028 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.4.0 Environment: Pyspark(2.4) in AWS EMR Reporter: alokchowdary
* I am trying to read the dat file using pyspark csv reader and it contains newline character ("\n") as part of the data. Spark is unable to read this file as single column, rather treating it as new row. I tried using the "multiLine" option while reading , but still its not working. * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}} * {{}}Data is something like this. Every line below is considered as row in dataframe. * Here '\x01' is actual delimeter(but used , for ease of reading). {{ }} {{1. name,test,12345,}} {{2. x, }} {{3. desc }} {{4. name2,test2,12345 }} {{5. ,y}} {{6. ,desc2}} * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls for other columns. How to read such data in pyspark -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org