GREGORY WERNER created SPARK-33488: -------------------------------------- Summary: Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline Key: SPARK-33488 URL: https://issues.apache.org/jira/browse/SPARK-33488 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5 Environment: Apache 2.4.5
Databricks 6.6 Reporter: GREGORY WERNER In #21820 I see what seems to be the same issue reported, but resolved. Over the past few days I have battled a dataset that occasionally has \r\n at the end of it. In my code, I do {code:java} // code placeholder# CSV options infer_schema = "false" first_row_is_header = "true" multi_line = "true" delimiter = "," # The applied options are for CSV files. For other file types, these will be ignored. df_train = spark.read.format(train_file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .option("multiLine", multi_line) \ .option("escape", '"') \ .load(train_file_location) {code} So I am reading in a csv file and setting multi_line to true. However, all cases where there are \r\n in the training_file, \r is left behind. This includes the header which has column ending in \r. The only way I have been able to workaround this is to manually edit the data file to remove the \r, but I do not want to do this on a case to case basis. Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org