GREGORY WERNER created SPARK-33488:
--------------------------------------

             Summary: Re SPARK-21820.  Creating Spark dataframe with carriage 
return/line feed leaves cr in multiline
                 Key: SPARK-33488
                 URL: https://issues.apache.org/jira/browse/SPARK-33488
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.5
         Environment: Apache 2.4.5

Databricks 6.6
            Reporter: GREGORY WERNER


In #21820 I see what seems to be the same issue reported, but resolved.  Over 
the past few days I have battled a dataset that occasionally has \r\n at the 
end of it.

In my code, I do 
{code:java}
// code placeholder# CSV options
infer_schema = "false"
first_row_is_header = "true"
multi_line = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be 
ignored.
df_train = spark.read.format(train_file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .option("multiLine", multi_line) \
  .option("escape", '"') \
  .load(train_file_location)
{code}
So I am reading in a csv file and setting multi_line to true.  However, all 
cases where there are \r\n in the training_file, \r is left behind.  This 
includes the header which has column ending in \r.  The only way I have been 
able to workaround this is to manually edit the data file to remove the \r, but 
I do not want to do this on a case to case basis.

Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to