Hi Aakash, What I see in the picture seems correct. Spark (pyspark) is reading your F2 cell as a multi-line text. Where are the nulls you're referring to? You might find the pyspark.sql.functions.regexp_replace <http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace> useful to remove new lines and unwanted characters: df.select(..., regexp_replace(<column-name>, '\s+|\n', ' '), ...)
Best, On Sun, Sep 3, 2017 at 12:15 PM, Aakash Basu <aakash.spark....@gmail.com> wrote: > Hi, > > I've a dataset where a few rows of the column F as shown below have line > breaks in CSV file. > > [image: Inline image 1] > > When Spark is reading it, it is coming as below, which is a complete new > line. > > [image: Inline image 2] > > I want my PySpark 2.1.0 to read it by forcefully avoiding the line break > after the date, which is not happening as I am using com.databricks.csv > reader. And nulls are getting created after the date for line 2 for the > rest of the columns from G till end. > > Can I please be helped how to handle this? > > Thanks, > Aakash. >