Re: trouble with NUMPY constructor in UDF

Ted Yu Wed, 09 Mar 2016 15:28:53 -0800

bq. epoch2numUDF = udf(foo, FloatType())

Is it possible that return value from foo is not FloatType ?


On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson <a...@santacruzintegration.com
> wrote:

> I need to convert time stamps into a format I can use with matplotlib
> plot_date(). epoch2num() works fine if I use it in my driver how ever I get
> a numpy constructor error if use it in a UDF
>
> Any idea what the problem is?
>
> Thanks
>
> Andy
>
> P.s I am using python3 and spark-1.6
>
> from pyspark.sql.functions import udf
> from pyspark.sql.types import FloatType, DoubleType, DecimalType
>
>
> import pandas as pd
> import numpy as np
>
> from matplotlib.dates import epoch2num
>
> gdf1 = cdf1.selectExpr("count", "row_key", "created", 
> "unix_timestamp(created) as ms")
>
> gdf1.printSchema()
>
> gdf1.show(10, truncate=False)
>
> root
>  |-- count: long (nullable = true)
>  |-- row_key: string (nullable = true)
>  |-- created: timestamp (nullable = true)
>  |-- ms: long (nullable = true)
>
> +-----+---------------+---------------------+----------+
> |count|row_key        |created              |ms        |
> +-----+---------------+---------------------+----------+
> |1    |HillaryClinton |2016-03-09 11:44:15.0|1457552655|
> |2    |HillaryClinton |2016-03-09 11:44:30.0|1457552670|
> |1    |HillaryClinton |2016-03-09 11:44:45.0|1457552685|
> |2    |realDonaldTrump|2016-03-09 11:44:15.0|1457552655|
> |1    |realDonaldTrump|2016-03-09 11:44:30.0|1457552670|
> |1    |realDonaldTrump|2016-03-09 11:44:45.0|1457552685|
> |3    |realDonaldTrump|2016-03-09 11:45:00.0|1457552700|
> +-----+---------------+---------------------+----------+
>
>
>
> def foo(e):
>     return epoch2num(e)
>
> epoch2numUDF = udf(foo, FloatType())
> #epoch2numUDF = udf(lambda e: epoch2num(e), FloatType())
> #epoch2numUDF = udf(lambda e: e + 5000000.5, FloatType())
>
> gdf2 = gdf1.withColumn("date", epoch2numUDF(gdf1.ms))
> gdf2.printSchema()
> gdf2.show(truncate=False)
>
>
> Py4JJavaError: An error occurred while calling o925.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 91, localhost): net.razorvine.pickle.PickleException: expected zero 
> arguments for construction of ClassDict (for numpy.dtype)
>       at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
>       at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>       at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>       at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>       at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>
>
> Works fine if I use PANDAS
>
> pdf = gdf1.toPandas()
> pdf['date'] = epoch2num(pdf['ms'] )
>
>
>

Re: trouble with NUMPY constructor in UDF

Reply via email to