Re: how to find the nearest holiday

2017-04-25 Thread Wen Pei Yu
TypeError: unorderable types: str() >= datetime.date()
 
Should transfer string to Date type when compare.
 
Yu Wenpei.
 
- Original message -From: Zeming Yu To: user Cc:Subject: how to find the nearest holidayDate: Tue, Apr 25, 2017 3:39 PM 
I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below?
 
 
 
start_date_test = flight3.select("start_date").distinct()
start_date_test.show()
 
holidays = ['2017-09-01', '2017-10-01']
 
+--+|start_date|+--+|2017-08-11||2017-09-11||2017-09-28||2017-06-29||2017-09-29||2017-07-31||2017-08-14||2017-08-18||2017-04-09||2017-09-21||2017-08-10||2017-06-30||2017-08-19||2017-07-06||2017-06-28||2017-09-14||2017-08-08||2017-08-22||2017-07-03||2017-07-30|+--+only showing top 20 rows
 
 
 
index = spark.sparkContext.broadcast(sorted(holidays))
 
def nearest_holiday(date):
    last_holiday = index.value[0]
    for next_holiday in index.value:
        if next_holiday >= date:
            break
        last_holiday = next_holiday
    if last_holiday > date:
        last_holiday = None
    if next_holiday < date:
        next_holiday = None
    return (last_holiday, next_holiday)
 
 
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
 
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
 
start_date_test.withColumn('holiday', nearest_holiday_udf('start_date')).show(5, False)
 
 
here's the error I got:
 
---Py4JJavaError                             Traceback (most recent call last) in ()     24 nearest_holiday_udf = udf(nearest_holiday, return_type)     25 ---> 26 start_date_test.withColumn('holiday', nearest_holiday_udf('start_date')).show(5, False)C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate)    318             print(self._jdf.showString(n, 20))    319         else:--> 320             print(self._jdf.showString(n, int(truncate)))    321     322     def __repr__(self):C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)   1131         answer = self.gateway_client.send_command(command)   1132         return_value = get_return_value(-> 1133             answer, self.gateway_client, self.target_id, self.name)   1134    1135         for temp_arg in temp_args:C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)     61     def deco(*a, **kw):     62         try:---> 63             return f(*a, **kw)     64         except py4j.protocol.Py4JJavaError as e:     65             s = e.java_exception.toString()C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)    317                 raise Py4JJavaError(    318                     "An error occurred while calling {0}{1}{2}.\n".--> 319                     format(target_id, ".", name), value)    320             else:    321                 raise Py4JError(Py4JJavaError: An error occurred while calling o566.showString.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 98.0 failed 1 times, most recent failure: Lost task 0.0 in stage 98.0 (TID 521, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 174, in main  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 169, in process  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 220, in dump_stream    self.serializer.dump_stream(self._batched(iterator), stream)  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 138, in dump_stream    for obj in iterator:  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 209, in _batched    for item in iterator:  File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 92, in   File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 68, in   File "", line 10, in nearest_holidayTypeError: unorderable types: str() >= datetime.date()at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)at 

Re: Aggregated column name

2017-03-23 Thread Wen Pei Yu

Thanks. Kevin

This works for one or two column agg.
But not work for this:

val expr = (Map("forCount" -> "count") ++ features.map((_ -> "mean")))
val averageDF = originalDF
  .withColumn("forCount", lit(0))
  .groupBy(col("..."))
  .agg(expr)

Yu Wenpei.



From:   Kevin Mellott <kevin.r.mell...@gmail.com>
To: Wen Pei Yu <yuw...@cn.ibm.com>
Cc: user <user@spark.apache.org>
Date:   03/24/2017 09:48 AM
Subject:Re: Aggregated column name



I'm not sure of the answer to your question; however, when performing
aggregates I find it useful to specify an alias for each column. That will
give you explicit control over the name of the resulting column.

In your example, that would look something like:

df.groupby(col("...")).agg(count("number")).alias("ColumnNameCount")

Hope that helps!
Kevin

On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
  Hi All

  I found some spark version(spark 1.4) return upper case aggregated
  column,  and some return low case.
  As below code,
  df.groupby(col("...")).agg(count("number"))
  may return

  COUNT(number)  -- spark 1,4
  count(number) - spark 1.6

  Anyone know if there is configure parameter for this, or which PR change
  this?

  Thank you very much.
  Yu Wenpei.

  - To
  unsubscribe e-mail: user-unsubscr...@spark.apache.org




Aggregated column name

2017-03-23 Thread Wen Pei Yu
Hi All
 
I found some spark version(spark 1.4) return upper case aggregated column,  and some return low case.
As below code,
df.groupby(col("...")).agg(count("number")) 
may return
 
COUNT(number)  -- spark 1,4
count(number) - spark 1.6
 
Anyone know if there is configure parameter for this, or which PR change this?
 
Thank you very much.
Yu Wenpei.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apply ML to grouped dataframe

2016-08-23 Thread Wen Pei Yu

Thank you Ayan.

For example, I have a dataframe below. And consider column "group" as key
to split this dataframe to three part, then want use kmeans to each split
part. To get each group's kmeans result.

+---+-++
| userID|group|features|
+---+-++
|12462563356|1|  [5.0,43.0]|
|12462563701|2|   [1.0,8.0]|
|12462563701|1|  [2.0,12.0]|
|12462564356|1|   [1.0,1.0]|
|12462565487|3|   [2.0,3.0]|
|12462565698|2|   [1.0,1.0]|
|12462565698|1|   [1.0,1.0]|
|12462566081|2|   [1.0,2.0]|
|12462566081|1|  [1.0,15.0]|
|12462566225|2|   [1.0,1.0]|
|12462566225|1|  [9.0,85.0]|
|12462566526|2|   [1.0,1.0]|
|12462566526|1|  [3.0,79.0]|
|12462567006|2| [11.0,15.0]|
|12462567006|1| [10.0,15.0]|
|12462567006|3| [10.0,15.0]|
|12462586595|2|  [2.0,42.0]|
|12462586595|3|  [2.0,16.0]|
|12462589343|3|   [1.0,1.0]|
+---+-++



From:   ayan guha <guha.a...@gmail.com>
To: Wen Pei Yu/China/IBM@IBMCN
Cc: user <user@spark.apache.org>, Nirmal Fernando <nir...@wso2.com>
Date:   08/23/2016 05:13 PM
Subject:Re: Apply ML to grouped dataframe



I would suggest you to construct a toy problem and post for solution. At
this moment it's a little unclear what your intentions are.


Generally speaking, group by on a data frame created another data frame,
not multiple ones.


On 23 Aug 2016 16:35, "Wen Pei Yu" <yuw...@cn.ibm.com> wrote:
  Hi Mirmal

  Filter works fine if I want handle one of grouped dataframe. But I has
  multiple grouped dataframe, I wish I can apply ML algorithm to all of
  them in one job, but not in for loops.

  Wenpei.

  Inactive hide details for Nirmal Fernando ---08/23/2016 01:55:46 PM---On
  Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuwenp@cn.iNirmal Fernando
  ---08/23/2016 01:55:46 PM---On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu
  <yuw...@cn.ibm.com> wrote: > We can group a dataframe b

  From: Nirmal Fernando <nir...@wso2.com>
  To: Wen Pei Yu/China/IBM@IBMCN
  Cc: User <user@spark.apache.org>
  Date: 08/23/2016 01:55 PM
  Subject: Re: Apply ML to grouped dataframe





  On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
We can group a dataframe by one column like

df.groupBy(df.col("gender"))


  On top of this DF, use a filter that would enable you to extract the
  grouped DF as separated DFs. Then you can apply ML on top of each DF.

  eg: xyzDF.filter(col("x").equalTo(x))

It like split a dataframe to multiple dataframe. Currently, we can
only apply simple sql function to this GroupedData like agg, max
etc.

What we want is apply one ML algorithm to each group.

Regards.

Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48
PM---Hi Wen, AFAIK Spark MLlib implements its machine learning
Nirmal Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark
MLlib implements its machine learning algorithms on top of

From: Nirmal Fernando <nir...@wso2.com>
To: Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date: 08/23/2016 01:14 PM



Subject: Re: Apply ML to grouped dataframe



Hi Wen,

AFAIK Spark MLlib implements its machine learning algorithms on top
of Spark dataframe API. What did you mean by a grouped dataframe?

On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com>
wrote:
Hi Nirmal

I didn't get your point.
Can you tell me more about how to use MLlib to grouped
dataframe?

Regards.
Wenpei.

Inactive hide details for Nirmal Fernando ---08/23/2016
10:26:36 AM---You can use Spark MLlib
http://spark.apache.org/docs/lateNirmal Fernando
---08/23/2016 10:26:36 AM---You can use Spark MLlib

http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas


From: Nirmal Fernando <nir...@wso2.com>
To: Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date: 08/23/2016 10:26 AM
Subject: Re: Apply ML to grouped dataframe




You can use Spark MLlib

http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api


On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <
yuw...@cn.ibm.com> wrote:
Hi

We have a dataframe, then want
   

Re: Apply ML to grouped dataframe

2016-08-23 Thread Wen Pei Yu

Hi Mirmal

Filter works fine if I want handle one of grouped dataframe. But I has
multiple grouped dataframe, I wish I can apply ML algorithm to all of them
in one job, but not in for loops.

Wenpei.



From:   Nirmal Fernando <nir...@wso2.com>
To:     Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date:   08/23/2016 01:55 PM
Subject:Re: Apply ML to grouped dataframe





On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
  We can group a dataframe by one column like

  df.groupBy(df.col("gender"))



On top of this DF, use a filter that would enable you to extract the
grouped DF as separated DFs. Then you can apply ML on top of each DF.

eg: xyzDF.filter(col("x").equalTo(x))

  It like split a dataframe to multiple dataframe. Currently, we can only
  apply simple sql function to this GroupedData like agg, max etc.

  What we want is apply one ML algorithm to each group.

  Regards.

  Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48 PM---Hi
  Wen, AFAIK Spark MLlib implements its machine learningNirmal Fernando
  ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements its
  machine learning algorithms on top of

  From: Nirmal Fernando <nir...@wso2.com>
  To: Wen Pei Yu/China/IBM@IBMCN
  Cc: User <user@spark.apache.org>
  Date: 08/23/2016 01:14 PM



  Subject: Re: Apply ML to grouped dataframe



  Hi Wen,

  AFAIK Spark MLlib implements its machine learning algorithms on top of
  Spark dataframe API. What did you mean by a grouped dataframe?

  On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
Hi Nirmal

I didn't get your point.
Can you tell me more about how to use MLlib to grouped dataframe?

Regards.
Wenpei.

Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36
AM---You can use Spark MLlib http://spark.apache.org/docs/late
Nirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib

http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas


    From: Nirmal Fernando <nir...@wso2.com>
To: Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date: 08/23/2016 10:26 AM
Subject: Re: Apply ML to grouped dataframe




You can use Spark MLlib

http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api


On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com>
wrote:
Hi

We have a dataframe, then want group it and apply a ML
algorithm or statistics(say t test) to each one. Is
there any efficient way for this situation?

Currently, we transfer to pyspark, use groupbykey and
apply numpy function to array. But this wasn't an
efficient way, right?

Regards.
Wenpei.




--

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/






  --

  Thanks & regards,
  Nirmal

  Team Lead - WSO2 Machine Learner
  Associate Technical Lead - Data Technologies Team, WSO2 Inc.
  Mobile: +94715779733
  Blog: http://nirmalfdo.blogspot.com/








--

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/





Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu

We can group a dataframe by one column like

df.groupBy(df.col("gender"))

It like split a dataframe to multiple dataframe. Currently, we can only
apply simple sql function to this GroupedData like agg, max etc.

What we want is apply one ML algorithm to each group.

Regards.



From:   Nirmal Fernando <nir...@wso2.com>
To: Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date:   08/23/2016 01:14 PM
Subject:Re: Apply ML to grouped dataframe



Hi Wen,

AFAIK Spark MLlib implements its machine learning algorithms on top of
Spark dataframe API. What did you mean by a grouped dataframe?

On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
  Hi Nirmal

  I didn't get your point.
  Can you tell me more about how to use MLlib to grouped dataframe?

  Regards.
  Wenpei.

  Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36 AM---You
  can use Spark MLlib http://spark.apache.org/docs/lateNirmal Fernando
  ---08/23/2016 10:26:36 AM---You can use Spark MLlib
  http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas


  From: Nirmal Fernando <nir...@wso2.com>
  To: Wen Pei Yu/China/IBM@IBMCN
  Cc: User <user@spark.apache.org>
  Date: 08/23/2016 10:26 AM
  Subject: Re: Apply ML to grouped dataframe




  You can use Spark MLlib
  
http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api


  On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
Hi

We have a dataframe, then want group it and apply a ML algorithm or
statistics(say t test) to each one. Is there any efficient way for
this situation?

Currently, we transfer to pyspark, use groupbykey and apply numpy
function to array. But this wasn't an efficient way, right?

Regards.
Wenpei.




  --

  Thanks & regards,
  Nirmal

  Team Lead - WSO2 Machine Learner
  Associate Technical Lead - Data Technologies Team, WSO2 Inc.
  Mobile: +94715779733
  Blog: http://nirmalfdo.blogspot.com/








--

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/




Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu

Hi Nirmal

I didn't get your point.
Can you tell me more about how to use MLlib to grouped dataframe?

Regards.
Wenpei.



From:   Nirmal Fernando <nir...@wso2.com>
To:     Wen Pei Yu/China/IBM@IBMCN
Cc: User <user@spark.apache.org>
Date:   08/23/2016 10:26 AM
Subject:Re: Apply ML to grouped dataframe



You can use Spark MLlib
http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api

On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote:
  Hi

  We have a dataframe, then want group it and apply a ML algorithm or
  statistics(say t test) to each one. Is there any efficient way for this
  situation?

  Currently, we transfer to pyspark, use groupbykey and apply numpy
  function to array. But this wasn't an efficient way, right?

  Regards.
  Wenpei.





--

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/




Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu

Hi

We have a dataframe, then want group it and apply a ML algorithm or
statistics(say t test) to each one. Is there any efficient way for this
situation?

Currently, we transfer to pyspark, use groupbykey and apply numpy function
to array. But this wasn't an efficient way, right?

Regards.
Wenpei.


Re: LogisticsRegression in ML pipeline help page

2016-01-06 Thread Wen Pei Yu

You can get old resource under
http://spark.apache.org/documentation.html

And linear doc here for 1.5.2

http://spark.apache.org/docs/1.5.2/mllib-linear-methods.html#logistic-regression
http://spark.apache.org/docs/1.5.2/ml-linear-methods.html


Regards.
Yu Wenpei.


From:   Arunkumar Pillai 
To: user@spark.apache.org
Date:   01/07/2016 12:54 PM
Subject:LogisticsRegression in ML pipeline help page



Hi

I need help page for Logistics Regression in ML pipeline. when i browsed
I'm getting the 1.6 help please help me.

--
Thanks and Regards
        Arun