Re: trouble with NUMPY constructor in UDF

Andy Davidson Thu, 10 Mar 2016 14:53:23 -0800

Hi Ted

In python the data type is float64¹. I have tried using both sql FloatType
and DoubleType how ever I get the same error


Strange

andy

From:  Ted Yu <yuzhih...@gmail.com>
Date:  Wednesday, March 9, 2016 at 3:28 PM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: trouble with NUMPY constructor in UDF

> bq. epoch2numUDF = udf(foo, FloatType())
> 
> Is it possible that return value from foo is not FloatType ?
> 
> On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> I need to convert time stamps into a format I can use with matplotlib
>> plot_date(). epoch2num() works fine if I use it in my driver how ever I get a
>> numpy constructor error if use it in a UDF
>> 
>> Any idea what the problem is?
>> 
>> Thanks
>> 
>> Andy
>> 
>> P.s I am using python3 and spark-1.6
>> 
>> from pyspark.sql.functions import udf
>> from pyspark.sql.types import FloatType, DoubleType, DecimalType
>> 
>> 
>> import pandas as pd
>> import numpy as np
>> 
>> from matplotlib.dates import epoch2num
>> 
>> gdf1 = cdf1.selectExpr("count", "row_key", "created",
>> "unix_timestamp(created) as ms")
>> gdf1.printSchema()
>> gdf1.show(10, truncate=False)
>> root
>>  |-- count: long (nullable = true)
>>  |-- row_key: string (nullable = true)
>>  |-- created: timestamp (nullable = true)
>>  |-- ms: long (nullable = true)
>> 
>> +-----+---------------+---------------------+----------+
>> |count|row_key        |created              |ms        |
>> +-----+---------------+---------------------+----------+
>> |1    |HillaryClinton |2016-03-09 11:44:15.0|1457552655|
>> |2    |HillaryClinton |2016-03-09 11:44:30.0|1457552670|
>> |1    |HillaryClinton |2016-03-09 11:44:45.0|1457552685|
>> |2    |realDonaldTrump|2016-03-09 11:44:15.0|1457552655|
>> |1    |realDonaldTrump|2016-03-09 11:44:30.0|1457552670|
>> |1    |realDonaldTrump|2016-03-09 11:44:45.0|1457552685|
>> |3    |realDonaldTrump|2016-03-09 11:45:00.0|1457552700|
>> +-----+---------------+---------------------+----------+
>> 
>> 
>> def foo(e):
>>     return epoch2num(e)
>> 
>> epoch2numUDF = udf(foo, FloatType())
>> #epoch2numUDF = udf(lambda e: epoch2num(e), FloatType())
>> #epoch2numUDF = udf(lambda e: e + 5000000.5, FloatType())
>> 
>> gdf2 = gdf1.withColumn("date", epoch2numUDF(gdf1.ms <http://gdf1.ms> ))
>> gdf2.printSchema()
>> gdf2.show(truncate=False)
>> 
>> 
>> Py4JJavaError: An error occurred while calling o925.showString.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> 32.0 (TID 91, localhost): net.razorvine.pickle.PickleException: expected zero
>> arguments for construction of ClassDict (for numpy.dtype)
>>      at 
>> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstruc
>> tor.java:23)
>>      at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>>      at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>>      at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>>      at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>> 
>> Works fine if I use PANDAS
>> 
>> pdf = gdf1.toPandas()
>> pdf['date'] = epoch2num(pdf['ms'] )
>> 
>> 
>

Re: trouble with NUMPY constructor in UDF

Reply via email to