Re: [pyspark] dataframe map_partition

Sean Owen Thu, 07 Mar 2019 12:41:00 -0800

Are you just applying a function to every row in the DataFrame? you
don't need pandas at all. Just get the RDD of Row from it and map a
UDF that makes another Row, and go back to DataFrame. Or make a UDF
that operates on all columns and returns a new value. mapPartitions is
also available if you want to transform an iterator of Row to another
iterator of Row.


On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote:
>
> it is very similar to SCALAR, but for SCALAR the output can't be struct/row 
> and the input has to be pd.Series, which doesn't support a row.
>
> I'm doing tensorflow batch inference in spark, 
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>
> Which i have to do the groupBy in order to use the apply function, i'm 
> wondering why not just enable apply to df ?
>
> On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote:
>>
>> Are you looking for SCALAR? that lets you map one row to one row, but
>> do it more efficiently in batch. What are you trying to do?
>>
>> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote:
>> >
>> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
>> >
>> > ```
>> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> > def do_nothing(pandas_df):
>> >     return pandas_df
>> >
>> >
>> > new_df = df.mapPartition(do_nothing)
>> > ```
>> > pandas_udf only support scala or GROUPED_MAP.  Why not support just Map?
>> >
>> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> wrote:
>> >>
>> >> Are you looking for @pandas_udf in Python? Or just mapPartition? Those 
>> >> exist already
>> >>
>> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote:
>> >>>
>> >>> There is a nice map_partition function in R `dapply`.  so that user can 
>> >>> pass a row to udf.
>> >>>
>> >>> I'm wondering why we don't have that in python?
>> >>>
>> >>> I'm trying to have a map_partition function with pandas_udf supported
>> >>>
>> >>> thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [pyspark] dataframe map_partition

Reply via email to