Thanks for the answer, I'm currently doing exactly that. I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon.
Regards, Olivier. Le mar. 2 juin 2015 à 02:38, Davies Liu <dav...@databricks.com> a écrit : > The second one sounds reasonable, I think. > > On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot > <o.girar...@lateral-thoughts.com> wrote: > > Hi everyone, > > Let's assume I have a complex workflow of more than 10 datasources as > input > > - 20 computations (some creating intermediary datasets and some merging > > everything for the final computation) - some taking on average 1 minute > to > > complete and some taking more than 30 minutes. > > > > What would be for you the best strategy to port this to Apache Spark ? > > > > Transform the whole flow into a Spark Job (PySpark or Scala) > > Transform only part of the flow (the heavy lifting ~30 min parts) using > the > > same language (PySpark) > > Transform only part of the flow and pipe the rest from Scala to Python > > > > Regards, > > > > Olivier. >