[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin closed SPARK-9131. ------------------------------ Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 (was: 1.4.2, 1.5.0) Going to close this since it's most likely fixed. [~lfag] [~luispeguerra] can you try it on branch-1.5? If it doesn't work, we should reopen this. > Python UDFs change data values > ------------------------------ > > Key: SPARK-9131 > URL: https://issues.apache.org/jira/browse/SPARK-9131 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.4.0, 1.4.1 > Environment: Pyspark 1.4 and 1.4.1 > Reporter: Luis Guerra > Assignee: Davies Liu > Priority: Blocker > Fix For: 1.5.0 > > Attachments: testjson_jira9131.z01, testjson_jira9131.z02, > testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, > testjson_jira9131.z06, testjson_jira9131.zip > > > I am having some troubles when using a custom udf in dataframes with pyspark > 1.4. > I have rewritten the udf to simplify the problem and it gets even weirder. > The udfs I am using do absolutely nothing, they just receive some value and > output the same value with the same format. > I show you my code below: > {code} > c= a.join(b, a['ID'] == b['ID_new'], 'inner') > c.filter(c['ID'] == '6000000002698917').show() > udf_A = UserDefinedFunction(lambda x: x, DateType()) > udf_B = UserDefinedFunction(lambda x: x, DateType()) > udf_C = UserDefinedFunction(lambda x: x, DateType()) > d = c.select(c['ID'], c['t1'].alias('ta'), > udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), > udf_C(vinc_muestra['t2']).alias('td')) > d.filter(d['ID'] == '6000000002698917').show() > {code} > I am showing here the results from the outputs: > {code} > +----------------+----------------+----------+----------+ > | ID | ID_new | t1 | t2 | > +----------------+----------------+----------+----------+ > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > +----------------+----------------+----------+----------+ > +----------------+---------------+---------------+------------+------------+ > | ID | ta | tb | tc | td > | > +----------------+---------------+---------------+------------+------------+ > |6000000002698917| 2012-02-28| 2007-03-05| 2003-03-05| > 2014-02-28| > |6000000002698917| 2012-02-20| 2007-02-15| 2002-02-15| > 2013-02-20| > |6000000002698917| 2012-02-28| 2007-03-10| 2005-03-10| > 2014-02-28| > |6000000002698917| 2012-02-20| 2007-03-05| 2003-03-05| > 2013-02-20| > |6000000002698917| 2012-02-20| 2013-08-02| 2013-01-02| > 2013-02-20| > |6000000002698917| 2012-02-28| 2007-02-15| 2002-02-15| > 2014-02-28| > |6000000002698917| 2012-02-28| 2007-02-15| 2002-02-15| > 2014-02-28| > |6000000002698917| 2012-02-20| 2014-01-02| 2013-01-02| > 2013-02-20| > +----------------+---------------+---------------+------------+------------+ > {code} > The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe > 'd' are completely different from values 't1' and 't2' in dataframe c even > when my udfs are doing nothing. It seems like if values were somehow got from > other registers (or just invented). Results are different between executions > (apparently random). > Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org