Kalle Jepsen created SPARK-7278:
-----------------------------------

             Summary: Inconsistent handling of dates in PySparks Row object
                 Key: SPARK-7278
                 URL: https://issues.apache.org/jira/browse/SPARK-7278
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.3.1
            Reporter: Kalle Jepsen


Consider the following Python code:

{code:none}
import datetime

rdd = sc.parallelize([[0, datetime.date(2014, 11, 11)], [1, 
datetime.date(2015,6,4)]])
df = rdd.toDF(schema=['rid', 'date'])
row = df.first()
{code}

Accessing the {{date}} column via {{\_\_getitem\_\_}} returns a 
{{datetime.datetime}} instance

{code:none}
>>>row[1]
datetime.datetime(2014, 11, 11, 0, 0)
{code}

while access via {{getattr}} returns a {{datetime.date}} instance:

{code:none}
>>>row.date
datetime.date(2014, 11, 11)
{code}

The problem seems to be that that Java deserializes the {{datetime.date}} 
objects to {{datetime.datetime}}. This is taken care of 
[here|https://github.com/apache/spark/blob/master/python/pyspark/sql/_types.py#L1027]
 when using {{getattr}}, but is overlooked when directly accessing the tuple by 
index.

Is there an easy way to fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to