Re: Best strategy for Pandas -> Spark
Thanks for the answer, I'm currently doing exactly that. I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon. Regards, Olivier. Le mar. 2 juin 2015 à 02:38, Davies Liu a écrit : > The second one sounds reasonable, I think. > > On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot > wrote: > > Hi everyone, > > Let's assume I have a complex workflow of more than 10 datasources as > input > > - 20 computations (some creating intermediary datasets and some merging > > everything for the final computation) - some taking on average 1 minute > to > > complete and some taking more than 30 minutes. > > > > What would be for you the best strategy to port this to Apache Spark ? > > > > Transform the whole flow into a Spark Job (PySpark or Scala) > > Transform only part of the flow (the heavy lifting ~30 min parts) using > the > > same language (PySpark) > > Transform only part of the flow and pipe the rest from Scala to Python > > > > Regards, > > > > Olivier. >
Re: Best strategy for Pandas -> Spark
The second one sounds reasonable, I think. On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot wrote: > Hi everyone, > Let's assume I have a complex workflow of more than 10 datasources as input > - 20 computations (some creating intermediary datasets and some merging > everything for the final computation) - some taking on average 1 minute to > complete and some taking more than 30 minutes. > > What would be for you the best strategy to port this to Apache Spark ? > > Transform the whole flow into a Spark Job (PySpark or Scala) > Transform only part of the flow (the heavy lifting ~30 min parts) using the > same language (PySpark) > Transform only part of the flow and pipe the rest from Scala to Python > > Regards, > > Olivier. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Best strategy for Pandas -> Spark
Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes. What would be for you the best strategy to port this to Apache Spark ? - Transform the whole flow into a Spark Job (PySpark or Scala) - Transform only part of the flow (the heavy lifting ~30 min parts) using the same language (PySpark) - Transform only part of the flow and pipe the rest from Scala to Python Regards, Olivier.