Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra <luispelay...@gmail.com> wrote: > Hi all, > > I am having some troubles when using a custom udf in dataframes with pyspark > 1.4. > > I have rewritten the udf to simplify the problem and it gets even weirder. > The udfs I am using do absolutely nothing, they just receive some value and > output the same value with the same format. > > I show you my code below: > > c= a.join(b, a['ID'] == b['ID_new'], 'inner') > > c.filter(c['ID'] == 'XX').show() > > udf_A = UserDefinedFunction(lambda x: x, DateType()) > udf_B = UserDefinedFunction(lambda x: x, DateType()) > udf_C = UserDefinedFunction(lambda x: x, DateType()) > > d = c.select(c['ID'], c['t1'].alias('ta'), > udf_A(vinc_muestra['t2']).alias('tb'), > udf_B(vinc_muestra['t1']).alias('tc'), > udf_C(vinc_muestra['t2']).alias('td')) > > d.filter(d['ID'] == 'XX').show() > > I am showing here the results from the outputs: > > +----------------+----------------+----------+----------+ > | ID | ID_new | t1 | t2 | > +----------------+----------------+----------+----------+ > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > +----------------+----------------+----------+----------+ > > +----------------+---------------+---------------+------------+------------+ > | ID | ta | tb | tc | td | > +----------------+---------------+---------------+------------+------------+ > |6000000002698917| 2012-02-28| 20070305| 20030305| 20140228| > |6000000002698917| 2012-02-20| 20070215| 20020215| 20130220| > |6000000002698917| 2012-02-28| 20070310| 20050310| 20140228| > |6000000002698917| 2012-02-20| 20070305| 20030305| 20130220| > |6000000002698917| 2012-02-20| 20130802| 20130102| 20130220| > |6000000002698917| 2012-02-28| 20070215| 20020215| 20140228| > |6000000002698917| 2012-02-28| 20070215| 20020215| 20140228| > |6000000002698917| 2012-02-20| 20140102| 20130102| 20130220| > +----------------+---------------+---------------+------------+------------+ > > My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe > 'd' are completely different from values 't1' and 't2' in dataframe c even > when my udfs are doing nothing. It seems like if values were somehow got > from other registers (or just invented). Results are different between > executions (apparently random). > > Any insight on this? > > Thanks in advance >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org