Thanks for the answers. I am trying to avoid reading the same data multiple times (each per model).
One approach I can think of is 'filtering' on the column I want to split on and train each model. I was hoping to find a more elegant approach. On Thu, Jan 21, 2021 at 5:28 PM Sean Owen <sro...@gmail.com> wrote: > If you mean you want to train N models in parallel, you wouldn't be able > to do that with a groupBy first. You apply logic to the result of groupBy > with Spark, but can't use Spark within Spark. You can run N Spark jobs in > parallel on the driver but you'd have to have each read the subset of data > that it's meant to model separately. > > A pandas UDF is a fine solution here, because I assume that implies your > groups aren't that big, so, maybe no need for a Spark pipeline. > > > On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <ferra...@gmail.com> > wrote: > >> Hi list, >> >> I am looking for an efficient solution to apply a training pipeline to >> each group of a DataFrame.groupBy. >> >> This is very easy if you're using a pandas udf (i.e. groupBy().apply()), >> I am not able to find the equivalent for a spark pipeline. >> >> The ultimate goal is to fit multiple models, one per group of data. >> >> Thanks, >> >>