Re: [pyspark] dataframe map_partition

peng yu Thu, 07 Mar 2019 12:46:20 -0800

pandas/arrow is for the memory efficiency, and mapPartitions is only
available to rdds, for sure i can do everything in rdd.


But i thought that's the whole point of having pandas_udf, so my program
run faster and consumes less memory ?

On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <[email protected]> wrote:

> Are you just applying a function to every row in the DataFrame? you
> don't need pandas at all. Just get the RDD of Row from it and map a
> UDF that makes another Row, and go back to DataFrame. Or make a UDF
> that operates on all columns and returns a new value. mapPartitions is
> also available if you want to transform an iterator of Row to another
> iterator of Row.
>
> On Thu, Mar 7, 2019 at 2:33 PM peng yu <[email protected]> wrote:
> >
> > it is very similar to SCALAR, but for SCALAR the output can't be
> struct/row and the input has to be pd.Series, which doesn't support a row.
> >
> > I'm doing tensorflow batch inference in spark,
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
> >
> > Which i have to do the groupBy in order to use the apply function, i'm
> wondering why not just enable apply to df ?
> >
> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <[email protected]> wrote:
> >>
> >> Are you looking for SCALAR? that lets you map one row to one row, but
> >> do it more efficiently in batch. What are you trying to do?
> >>
> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <[email protected]> wrote:
> >> >
> >> > I'm looking for a mapPartition(pandas_udf) for  a pyspark.Dataframe.
> >> >
> >> > ```
> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
> >> > def do_nothing(pandas_df):
> >> >     return pandas_df
> >> >
> >> >
> >> > new_df = df.mapPartition(do_nothing)
> >> > ```
> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support just
> Map?
> >> >
> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <[email protected]> wrote:
> >> >>
> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition?
> Those exist already
> >> >>
> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <[email protected]> wrote:
> >> >>>
> >> >>> There is a nice map_partition function in R `dapply`.  so that user
> can pass a row to udf.
> >> >>>
> >> >>> I'm wondering why we don't have that in python?
> >> >>>
> >> >>> I'm trying to have a map_partition function with pandas_udf
> supported
> >> >>>
> >> >>> thanks!
>

Re: [pyspark] dataframe map_partition

Reply via email to