Re: how to find the nearest holiday
TypeError: unorderable types: str() >= datetime.date() Should transfer string to Date type when compare. Yu Wenpei. - Original message -From: Zeming YuTo: user Cc:Subject: how to find the nearest holidayDate: Tue, Apr 25, 2017 3:39 PM I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below? start_date_test = flight3.select("start_date").distinct() start_date_test.show() holidays = ['2017-09-01', '2017-10-01'] +--+|start_date|+--+|2017-08-11||2017-09-11||2017-09-28||2017-06-29||2017-09-29||2017-07-31||2017-08-14||2017-08-18||2017-04-09||2017-09-21||2017-08-10||2017-06-30||2017-08-19||2017-07-06||2017-06-28||2017-09-14||2017-08-08||2017-08-22||2017-07-03||2017-07-30|+--+only showing top 20 rows index = spark.sparkContext.broadcast(sorted(holidays)) def nearest_holiday(date): last_holiday = index.value[0] for next_holiday in index.value: if next_holiday >= date: break last_holiday = next_holiday if last_holiday > date: last_holiday = None if next_holiday < date: next_holiday = None return (last_holiday, next_holiday) from pyspark.sql.types import * return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())]) from pyspark.sql.functions import udf nearest_holiday_udf = udf(nearest_holiday, return_type) start_date_test.withColumn('holiday', nearest_holiday_udf('start_date')).show(5, False) here's the error I got: ---Py4JJavaError Traceback (most recent call last) in () 24 nearest_holiday_udf = udf(nearest_holiday, return_type) 25 ---> 26 start_date_test.withColumn('holiday', nearest_holiday_udf('start_date')).show(5, False)C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate) 318 print(self._jdf.showString(n, 20)) 319 else:--> 320 print(self._jdf.showString(n, int(truncate))) 321 322 def __repr__(self):C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value(-> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args:C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try:---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n".--> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError(Py4JJavaError: An error occurred while calling o566.showString.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 98.0 failed 1 times, most recent failure: Lost task 0.0 in stage 98.0 (TID 521, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 174, in main File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 169, in process File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 220, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 138, in dump_stream for obj in iterator: File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 209, in _batched for item in iterator: File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 92, in File "C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 68, in File "", line 10, in nearest_holidayTypeError: unorderable types: str() >= datetime.date()at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)at
Re: Aggregated column name
Thanks. Kevin This works for one or two column agg. But not work for this: val expr = (Map("forCount" -> "count") ++ features.map((_ -> "mean"))) val averageDF = originalDF .withColumn("forCount", lit(0)) .groupBy(col("...")) .agg(expr) Yu Wenpei. From: Kevin Mellott <kevin.r.mell...@gmail.com> To: Wen Pei Yu <yuw...@cn.ibm.com> Cc: user <user@spark.apache.org> Date: 03/24/2017 09:48 AM Subject:Re: Aggregated column name I'm not sure of the answer to your question; however, when performing aggregates I find it useful to specify an alias for each column. That will give you explicit control over the name of the resulting column. In your example, that would look something like: df.groupby(col("...")).agg(count("number")).alias("ColumnNameCount") Hope that helps! Kevin On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi All I found some spark version(spark 1.4) return upper case aggregated column, and some return low case. As below code, df.groupby(col("...")).agg(count("number")) may return COUNT(number) -- spark 1,4 count(number) - spark 1.6 Anyone know if there is configure parameter for this, or which PR change this? Thank you very much. Yu Wenpei. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Aggregated column name
Hi All I found some spark version(spark 1.4) return upper case aggregated column, and some return low case. As below code, df.groupby(col("...")).agg(count("number")) may return COUNT(number) -- spark 1,4 count(number) - spark 1.6 Anyone know if there is configure parameter for this, or which PR change this? Thank you very much. Yu Wenpei. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apply ML to grouped dataframe
Thank you Ayan. For example, I have a dataframe below. And consider column "group" as key to split this dataframe to three part, then want use kmeans to each split part. To get each group's kmeans result. +---+-++ | userID|group|features| +---+-++ |12462563356|1| [5.0,43.0]| |12462563701|2| [1.0,8.0]| |12462563701|1| [2.0,12.0]| |12462564356|1| [1.0,1.0]| |12462565487|3| [2.0,3.0]| |12462565698|2| [1.0,1.0]| |12462565698|1| [1.0,1.0]| |12462566081|2| [1.0,2.0]| |12462566081|1| [1.0,15.0]| |12462566225|2| [1.0,1.0]| |12462566225|1| [9.0,85.0]| |12462566526|2| [1.0,1.0]| |12462566526|1| [3.0,79.0]| |12462567006|2| [11.0,15.0]| |12462567006|1| [10.0,15.0]| |12462567006|3| [10.0,15.0]| |12462586595|2| [2.0,42.0]| |12462586595|3| [2.0,16.0]| |12462589343|3| [1.0,1.0]| +---+-++ From: ayan guha <guha.a...@gmail.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: user <user@spark.apache.org>, Nirmal Fernando <nir...@wso2.com> Date: 08/23/2016 05:13 PM Subject:Re: Apply ML to grouped dataframe I would suggest you to construct a toy problem and post for solution. At this moment it's a little unclear what your intentions are. Generally speaking, group by on a data frame created another data frame, not multiple ones. On 23 Aug 2016 16:35, "Wen Pei Yu" <yuw...@cn.ibm.com> wrote: Hi Mirmal Filter works fine if I want handle one of grouped dataframe. But I has multiple grouped dataframe, I wish I can apply ML algorithm to all of them in one job, but not in for loops. Wenpei. Inactive hide details for Nirmal Fernando ---08/23/2016 01:55:46 PM---On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuwenp@cn.iNirmal Fernando ---08/23/2016 01:55:46 PM---On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: > We can group a dataframe b From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 01:55 PM Subject: Re: Apply ML to grouped dataframe On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: We can group a dataframe by one column like df.groupBy(df.col("gender")) On top of this DF, use a filter that would enable you to extract the grouped DF as separated DFs. Then you can apply ML on top of each DF. eg: xyzDF.filter(col("x").equalTo(x)) It like split a dataframe to multiple dataframe. Currently, we can only apply simple sql function to this GroupedData like agg, max etc. What we want is apply one ML algorithm to each group. Regards. Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements its machine learning Nirmal Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 01:14 PM Subject: Re: Apply ML to grouped dataframe Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/lateNirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 10:26 AM Subject: Re: Apply ML to grouped dataframe You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu < yuw...@cn.ibm.com> wrote: Hi We have a dataframe, then want
Re: Apply ML to grouped dataframe
Hi Mirmal Filter works fine if I want handle one of grouped dataframe. But I has multiple grouped dataframe, I wish I can apply ML algorithm to all of them in one job, but not in for loops. Wenpei. From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 01:55 PM Subject:Re: Apply ML to grouped dataframe On Tue, Aug 23, 2016 at 10:56 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: We can group a dataframe by one column like df.groupBy(df.col("gender")) On top of this DF, use a filter that would enable you to extract the grouped DF as separated DFs. Then you can apply ML on top of each DF. eg: xyzDF.filter(col("x").equalTo(x)) It like split a dataframe to multiple dataframe. Currently, we can only apply simple sql function to this GroupedData like agg, max etc. What we want is apply one ML algorithm to each group. Regards. Inactive hide details for Nirmal Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements its machine learningNirmal Fernando ---08/23/2016 01:14:48 PM---Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 01:14 PM Subject: Re: Apply ML to grouped dataframe Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/late Nirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 10:26 AM Subject: Re: Apply ML to grouped dataframe You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards. Wenpei. -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Apply ML to grouped dataframe
We can group a dataframe by one column like df.groupBy(df.col("gender")) It like split a dataframe to multiple dataframe. Currently, we can only apply simple sql function to this GroupedData like agg, max etc. What we want is apply one ML algorithm to each group. Regards. From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 01:14 PM Subject:Re: Apply ML to grouped dataframe Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On Tue, Aug 23, 2016 at 10:42 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. Inactive hide details for Nirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/lateNirmal Fernando ---08/23/2016 10:26:36 AM---You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-bas From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 10:26 AM Subject: Re: Apply ML to grouped dataframe You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards. Wenpei. -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: Apply ML to grouped dataframe
Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. From: Nirmal Fernando <nir...@wso2.com> To: Wen Pei Yu/China/IBM@IBMCN Cc: User <user@spark.apache.org> Date: 08/23/2016 10:26 AM Subject:Re: Apply ML to grouped dataframe You can use Spark MLlib http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api On Tue, Aug 23, 2016 at 7:34 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards. Wenpei. -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Apply ML to grouped dataframe
Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards. Wenpei.
Re: LogisticsRegression in ML pipeline help page
You can get old resource under http://spark.apache.org/documentation.html And linear doc here for 1.5.2 http://spark.apache.org/docs/1.5.2/mllib-linear-methods.html#logistic-regression http://spark.apache.org/docs/1.5.2/ml-linear-methods.html Regards. Yu Wenpei. From: Arunkumar PillaiTo: user@spark.apache.org Date: 01/07/2016 12:54 PM Subject:LogisticsRegression in ML pipeline help page Hi I need help page for Logistics Regression in ML pipeline. when i browsed I'm getting the 1.6 help please help me. -- Thanks and Regards Arun