[ https://issues.apache.org/jira/browse/SPARK-21011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Landes updated SPARK-21011: ---------------------------------- Description: I used PySpark to read in some CSV files (actually separated by backspace, might be relevant). The resulting dataframe.show() gives me good data - all my columns are there, everything's great. df = spark.read.option('delimiter', '\b').csv('<some S3 location>') df.show() # all is good here Now, I want to filter this bad boy... but I want to use RDD's filters because they're just nicer to use. my_rdd = df.rdd my_rdd.take(5) #all my columns are still here filtered_rdd = my_rdd.filter(<some filter criteria here>) filtered_rdd.take(5) My filtered_rdd is missing a column. Specifically, _c2 has been mashed in to _c1. Here's a relevant record (anonymized) from the df.show(): |3 |Text Field |12345|<some alphanumeric ID mess here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 13:02:33|true|false| ...and the return from the filtered_rdd.take() Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID mess here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', _c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', _c9=u'false', _c10=None) Look at _c1 there - it's been mishmashed together with what was formerly _c2 (with an ascii backspace - \x08 - in there)... and poor old _c10 is left without a value. was: I used PySpark to read in some CSV files (actually separated by backspace, might be relevant). The resulting dataframe.show() gives me good data - all my columns are there, everything's great. df = spark.read.option('delimiter', '\b').csv('<some S3 location>') df.show() # all is good here Now, I want to filter this bad boy... but I want to use RDD's filters because they're just nicer to use. my_rdd = df.rdd my_rdd.take(5) #all my columns are still here filtered_rdd = my_rdd.filter(<some filter criteria here>) filtered_rdd.take(5) My filtered_rdd is missing a column. Specifically, _c2 has been mashed in to _c1. Here's a relevant record (anonymized) from the df.show(): |3 |Text Field |12345|<some alphanumeric ID mess here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 13:02:33|true|false| ...and the return from the filtered_rdd.take() Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID mess here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', _c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', _c9=u'false', _c10=None) Look at _c1 there - it's been mishmashed together with what was formerly _c2... and poor old _c10 is left without a value. > RDD filter can combine/corrupt columns > -------------------------------------- > > Key: SPARK-21011 > URL: https://issues.apache.org/jira/browse/SPARK-21011 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.1.0 > Reporter: Steven Landes > > I used PySpark to read in some CSV files (actually separated by backspace, > might be relevant). The resulting dataframe.show() gives me good data - all > my columns are there, everything's great. > df = spark.read.option('delimiter', '\b').csv('<some S3 location>') > df.show() # all is good here > Now, I want to filter this bad boy... but I want to use RDD's filters > because they're just nicer to use. > my_rdd = df.rdd > my_rdd.take(5) #all my columns are still here > filtered_rdd = my_rdd.filter(<some filter criteria here>) > filtered_rdd.take(5) > My filtered_rdd is missing a column. Specifically, _c2 has been mashed in to > _c1. > Here's a relevant record (anonymized) from the df.show(): > |3 |Text Field |12345|<some alphanumeric ID mess > here>|150.00|UserName|2012-08-14 00:50:00|2015-02-24 01:23:45|2017-02-34 > 13:02:33|true|false| > ...and the return from the filtered_rdd.take() > Row(_c0=u'3', _c1=u'"Text Field"\x08"12345"', _c2=u'|<some alphanumeric ID > mess here>', _c3=u'150.00', _c4=u'UserName', _c5=u'2012-08-14 00:50:00', > _c6=u'2015-02-24 01:23:45', _c7=u'2017-02-34 13:02:33', _c8=u'true', > _c9=u'false', _c10=None) > Look at _c1 there - it's been mishmashed together with what was formerly _c2 > (with an ascii backspace - \x08 - in there)... and poor old _c10 is left > without a value. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org