Re: pyspark 1.4 udf change date values
Sure, I have created JIRA SPARK-9131 - UDF change data values https://issues.apache.org/jira/browse/SPARK-9131 On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu dav...@databricks.com wrote: Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra luispelay...@gmail.com wrote: Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == 'XX').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == 'XX').show() I am showing here the results from the outputs: +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta |tb | tc| td | ++---+---+++ |62698917| 2012-02-28| 20070305|20030305| 20140228| |62698917| 2012-02-20| 20070215|20020215| 20130220| |62698917| 2012-02-28| 20070310|20050310| 20140228| |62698917| 2012-02-20| 20070305|20030305| 20130220| |62698917| 2012-02-20| 20130802|20130102| 20130220| |62698917| 2012-02-28| 20070215|20020215| 20140228| |62698917| 2012-02-28| 20070215|20020215| 20140228| |62698917| 2012-02-20| 20140102|20130102| 20130220| ++---+---+++ My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Any insight on this? Thanks in advance
pyspark 1.4 udf change date values
Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == 'XX').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == 'XX').show() I am showing here the results from the outputs: +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta |tb | tc| td | ++---+---+++ |62698917| 2012-02-28| 20070305|20030305|20140228| |62698917| 2012-02-20| 20070215|20020215|20130220| |62698917| 2012-02-28| 20070310|20050310|20140228| |62698917| 2012-02-20| 20070305|20030305|20130220| |62698917| 2012-02-20| 20130802|20130102|20130220| |62698917| 2012-02-28| 20070215|20020215|20140228| |62698917| 2012-02-28| 20070215|20020215|20140228| |62698917| 2012-02-20| 20140102|20130102|20130220| ++---+---+++ My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Any insight on this? Thanks in advance
Re: pyspark 1.4 udf change date values
Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra luispelay...@gmail.com wrote: Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == 'XX').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == 'XX').show() I am showing here the results from the outputs: +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta |tb | tc| td | ++---+---+++ |62698917| 2012-02-28| 20070305|20030305|20140228| |62698917| 2012-02-20| 20070215|20020215|20130220| |62698917| 2012-02-28| 20070310|20050310|20140228| |62698917| 2012-02-20| 20070305|20030305|20130220| |62698917| 2012-02-20| 20130802|20130102|20130220| |62698917| 2012-02-28| 20070215|20020215|20140228| |62698917| 2012-02-28| 20070215|20020215|20140228| |62698917| 2012-02-20| 20140102|20130102|20130220| ++---+---+++ My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Any insight on this? Thanks in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org