[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread L. C. Hsieh (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197073#comment-17197073 ] L. C. Hsieh commented on SPARK-32888: - Yes, there is difference. But it is due to reading file and

[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197069#comment-17197069 ] Punit Shah commented on SPARK-32888: Thank you for your reply [~viirya]  However what I've noticed

[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread L. C. Hsieh (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197046#comment-17197046 ] L. C. Hsieh commented on SPARK-32888: - Reading csv files is simple. We can just remove first line.

[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196985#comment-17196985 ] Punit Shah commented on SPARK-32888: Why do we remove lines that are the same as the header? The

[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-15 Thread Apache Spark (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196591#comment-17196591 ] Apache Spark commented on SPARK-32888: -- User 'viirya' has created a pull request for this issue:

[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-15 Thread L. C. Hsieh (Jira)
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196590#comment-17196590 ] L. C. Hsieh commented on SPARK-32888: - This was documented in CSV related codes, although it seems