[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812098#comment-17812098 ]
Jie Han edited comment on SPARK-46876 at 1/30/24 3:01 AM: ---------------------------------------------------------- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing and trim is an option.}} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing and trim is an option.}} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > ----------------------------------------------------------------------- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.4.1 > Reporter: Martin Rueckl > Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv("<tabseparated > file>").collect() > spark.read.option("header","true").option("sep",";").csv("<semicolon > file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org