Re: pyspark 1.4 udf change date values

Davies Liu Thu, 16 Jul 2015 10:10:00 -0700

Thanks for reporting this, could you file a JIRA for it?

On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra <luispelay...@gmail.com> wrote:
> Hi all,
>
> I am having some troubles when using a custom udf in dataframes with pyspark
> 1.4.
>
> I have rewritten the udf to simplify the problem and it gets even weirder.
> The udfs I am using do absolutely nothing, they just receive some value and
> output the same value with the same format.
>
> I show you my code below:
>
> c= a.join(b, a['ID'] == b['ID_new'], 'inner')
>
> c.filter(c['ID'] == 'XX').show()
>
> udf_A = UserDefinedFunction(lambda x: x, DateType())
> udf_B = UserDefinedFunction(lambda x: x, DateType())
> udf_C = UserDefinedFunction(lambda x: x, DateType())
>
> d = c.select(c['ID'], c['t1'].alias('ta'),
> udf_A(vinc_muestra['t2']).alias('tb'),
> udf_B(vinc_muestra['t1']).alias('tc'),
> udf_C(vinc_muestra['t2']).alias('td'))
>
> d.filter(d['ID'] == 'XX').show()
>
> I am showing here the results from the outputs:
>
> +----------------+----------------+----------+----------+
> |          ID     |     ID_new  |     t1 |   t2  |
> +----------------+----------------+----------+----------+
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> +----------------+----------------+----------+----------+
>
> +----------------+---------------+---------------+------------+------------+
> |       ID        |     ta  |    tb  | tc    |  td   |
> +----------------+---------------+---------------+------------+------------+
> |6000000002698917|     2012-02-28|       20070305|    20030305|    20140228|
> |6000000002698917|     2012-02-20|       20070215|    20020215|    20130220|
> |6000000002698917|     2012-02-28|       20070310|    20050310|    20140228|
> |6000000002698917|     2012-02-20|       20070305|    20030305|    20130220|
> |6000000002698917|     2012-02-20|       20130802|    20130102|    20130220|
> |6000000002698917|     2012-02-28|       20070215|    20020215|    20140228|
> |6000000002698917|     2012-02-28|       20070215|    20020215|    20140228|
> |6000000002698917|     2012-02-20|       20140102|    20130102|    20130220|
> +----------------+---------------+---------------+------------+------------+
>
> My problem here is that values at columns 'tb', 'tc' and 'td' in dataframe
> 'd' are completely different from values 't1' and 't2' in dataframe c even
> when my udfs are doing nothing. It seems like if values were somehow got
> from other registers (or just invented). Results are different between
> executions (apparently random).
>
> Any insight on this?
>
> Thanks in advance
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark 1.4 udf change date values

Reply via email to