subject:"pyspark 1.4 udf change date values"

Re: pyspark 1.4 udf change date values

2015-07-17 Thread Luis Guerra

Sure, I have created JIRA SPARK-9131 - UDF change data values
https://issues.apache.org/jira/browse/SPARK-9131

On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu dav...@databricks.com wrote:

 Thanks for reporting this, could you file a JIRA for it?

 On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra luispelay...@gmail.com
 wrote:
  Hi all,
 
  I am having some troubles when using a custom udf in dataframes with
 pyspark
  1.4.
 
  I have rewritten the udf to simplify the problem and it gets even
 weirder.
  The udfs I am using do absolutely nothing, they just receive some value
 and
  output the same value with the same format.
 
  I show you my code below:
 
  c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 
  c.filter(c['ID'] == 'XX').show()
 
  udf_A = UserDefinedFunction(lambda x: x, DateType())
  udf_B = UserDefinedFunction(lambda x: x, DateType())
  udf_C = UserDefinedFunction(lambda x: x, DateType())
 
  d = c.select(c['ID'], c['t1'].alias('ta'),
  udf_A(vinc_muestra['t2']).alias('tb'),
  udf_B(vinc_muestra['t1']).alias('tc'),
  udf_C(vinc_muestra['t2']).alias('td'))
 
  d.filter(d['ID'] == 'XX').show()
 
  I am showing here the results from the outputs:
 
  +++--+--+
  |  ID | ID_new  | t1 |   t2  |
  +++--+--+
  |62698917|   62698917|   2012-02-28|   2014-02-28|
  |62698917|   62698917|   2012-02-20|   2013-02-20|
  |62698917|   62698917|   2012-02-28|   2014-02-28|
  |62698917|   62698917|   2012-02-20|   2013-02-20|
  |62698917|   62698917|   2012-02-20|   2013-02-20|
  |62698917|   62698917|   2012-02-28|   2014-02-28|
  |62698917|   62698917|   2012-02-28|   2014-02-28|
  |62698917|   62698917|   2012-02-20|   2013-02-20|
  +++--+--+
 
 
 ++---+---+++
  |   ID| ta  |tb  | tc|  td   |
 
 ++---+---+++
  |62698917| 2012-02-28|   20070305|20030305|
 20140228|
  |62698917| 2012-02-20|   20070215|20020215|
 20130220|
  |62698917| 2012-02-28|   20070310|20050310|
 20140228|
  |62698917| 2012-02-20|   20070305|20030305|
 20130220|
  |62698917| 2012-02-20|   20130802|20130102|
 20130220|
  |62698917| 2012-02-28|   20070215|20020215|
 20140228|
  |62698917| 2012-02-28|   20070215|20020215|
 20140228|
  |62698917| 2012-02-20|   20140102|20130102|
 20130220|
 
 ++---+---+++
 
  My problem here is that values at columns 'tb', 'tc' and 'td' in
 dataframe
  'd' are completely different from values 't1' and 't2' in dataframe c
 even
  when my udfs are doing nothing. It seems like if values were somehow got
  from other registers (or just invented). Results are different between
  executions (apparently random).
 
  Any insight on this?
 
  Thanks in advance

pyspark 1.4 udf change date values

2015-07-16 Thread Luis Guerra

Hi all,

I am having some troubles when using a custom udf in dataframes with
pyspark 1.4.

I have rewritten the udf to simplify the problem and it gets even weirder.
The udfs I am using do absolutely nothing, they just receive some value and
output the same value with the same format.

I show you my code below:

c= a.join(b, a['ID'] == b['ID_new'], 'inner')

c.filter(c['ID'] == 'XX').show()

udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())

d = c.select(c['ID'],
c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'),
udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))

d.filter(d['ID'] == 'XX').show()

I am showing here the results from the outputs:

+++--+--+
|  ID | ID_new  | t1 |   t2  |
+++--+--+
|62698917|   62698917|   2012-02-28|   2014-02-28|
|62698917|   62698917|   2012-02-20|   2013-02-20|
|62698917|   62698917|   2012-02-28|   2014-02-28|
|62698917|   62698917|   2012-02-20|   2013-02-20|
|62698917|   62698917|   2012-02-20|   2013-02-20|
|62698917|   62698917|   2012-02-28|   2014-02-28|
|62698917|   62698917|   2012-02-28|   2014-02-28|
|62698917|   62698917|   2012-02-20|   2013-02-20|
+++--+--+

++---+---+++
|   ID| ta  |tb  | tc|  td   |
++---+---+++
|62698917| 2012-02-28|   20070305|20030305|20140228|
|62698917| 2012-02-20|   20070215|20020215|20130220|
|62698917| 2012-02-28|   20070310|20050310|20140228|
|62698917| 2012-02-20|   20070305|20030305|20130220|
|62698917| 2012-02-20|   20130802|20130102|20130220|
|62698917| 2012-02-28|   20070215|20020215|20140228|
|62698917| 2012-02-28|   20070215|20020215|20140228|
|62698917| 2012-02-20|   20140102|20130102|20130220|
++---+---+++

My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe
'd' are completely different from values 't1' and 't2' in dataframe c even
when my udfs are doing nothing. It seems like if values were somehow got
from other registers (or just invented). Results are different between
executions (apparently random).

Any insight on this?

Thanks in advance

Re: pyspark 1.4 udf change date values

2015-07-16 Thread Davies Liu

Thanks for reporting this, could you file a JIRA for it?

On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra luispelay...@gmail.com wrote:
 Hi all,

 I am having some troubles when using a custom udf in dataframes with pyspark
 1.4.

 I have rewritten the udf to simplify the problem and it gets even weirder.
 The udfs I am using do absolutely nothing, they just receive some value and
 output the same value with the same format.

 I show you my code below:

 c= a.join(b, a['ID'] == b['ID_new'], 'inner')

 c.filter(c['ID'] == 'XX').show()

 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())

 d = c.select(c['ID'], c['t1'].alias('ta'),
 udf_A(vinc_muestra['t2']).alias('tb'),
 udf_B(vinc_muestra['t1']).alias('tc'),
 udf_C(vinc_muestra['t2']).alias('td'))

 d.filter(d['ID'] == 'XX').show()

 I am showing here the results from the outputs:

 +++--+--+
 |  ID | ID_new  | t1 |   t2  |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+

 ++---+---+++
 |   ID| ta  |tb  | tc|  td   |
 ++---+---+++
 |62698917| 2012-02-28|   20070305|20030305|20140228|
 |62698917| 2012-02-20|   20070215|20020215|20130220|
 |62698917| 2012-02-28|   20070310|20050310|20140228|
 |62698917| 2012-02-20|   20070305|20030305|20130220|
 |62698917| 2012-02-20|   20130802|20130102|20130220|
 |62698917| 2012-02-28|   20070215|20020215|20140228|
 |62698917| 2012-02-28|   20070215|20020215|20140228|
 |62698917| 2012-02-20|   20140102|20130102|20130220|
 ++---+---+++

 My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe
 'd' are completely different from values 't1' and 't2' in dataframe c even
 when my udfs are doing nothing. It seems like if values were somehow got
 from other registers (or just invented). Results are different between
 executions (apparently random).

 Any insight on this?

 Thanks in advance


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark 1.4 udf change date values

pyspark 1.4 udf change date values

Re: pyspark 1.4 udf change date values

3 matches

Site Navigation

Mail list logo

Footer information