Kalle Jepsen created SPARK-7278: ----------------------------------- Summary: Inconsistent handling of dates in PySparks Row object Key: SPARK-7278 URL: https://issues.apache.org/jira/browse/SPARK-7278 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Reporter: Kalle Jepsen
Consider the following Python code: {code:none} import datetime rdd = sc.parallelize([[0, datetime.date(2014, 11, 11)], [1, datetime.date(2015,6,4)]]) df = rdd.toDF(schema=['rid', 'date']) row = df.first() {code} Accessing the {{date}} column via {{\_\_getitem\_\_}} returns a {{datetime.datetime}} instance {code:none} >>>row[1] datetime.datetime(2014, 11, 11, 0, 0) {code} while access via {{getattr}} returns a {{datetime.date}} instance: {code:none} >>>row.date datetime.date(2014, 11, 11) {code} The problem seems to be that that Java deserializes the {{datetime.date}} objects to {{datetime.datetime}}. This is taken care of [here|https://github.com/apache/spark/blob/master/python/pyspark/sql/_types.py#L1027] when using {{getattr}}, but is overlooked when directly accessing the tuple by index. Is there an easy way to fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org