Re: Best strategy for Pandas -> Spark

2015-06-02 Thread Olivier Girardot
Thanks for the answer, I'm currently doing exactly that.
I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon.

Regards,

Olivier.

Le mar. 2 juin 2015 à 02:38, Davies Liu  a écrit :

> The second one sounds reasonable, I think.
>
> On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
>  wrote:
> > Hi everyone,
> > Let's assume I have a complex workflow of more than 10 datasources as
> input
> > - 20 computations (some creating intermediary datasets and some merging
> > everything for the final computation) - some taking on average 1 minute
> to
> > complete and some taking more than 30 minutes.
> >
> > What would be for you the best strategy to port this to Apache Spark ?
> >
> > Transform the whole flow into a Spark Job (PySpark or Scala)
> > Transform only part of the flow (the heavy lifting ~30 min parts) using
> the
> > same language (PySpark)
> > Transform only part of the flow and pipe the rest from Scala to Python
> >
> > Regards,
> >
> > Olivier.
>


Re: Best strategy for Pandas -> Spark

2015-06-01 Thread Davies Liu
The second one sounds reasonable, I think.

On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
 wrote:
> Hi everyone,
> Let's assume I have a complex workflow of more than 10 datasources as input
> - 20 computations (some creating intermediary datasets and some merging
> everything for the final computation) - some taking on average 1 minute to
> complete and some taking more than 30 minutes.
>
> What would be for you the best strategy to port this to Apache Spark ?
>
> Transform the whole flow into a Spark Job (PySpark or Scala)
> Transform only part of the flow (the heavy lifting ~30 min parts) using the
> same language (PySpark)
> Transform only part of the flow and pipe the rest from Scala to Python
>
> Regards,
>
> Olivier.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best strategy for Pandas -> Spark

2015-04-30 Thread Olivier Girardot
Hi everyone,
Let's assume I have a complex workflow of more than 10 datasources as input
- 20 computations (some creating intermediary datasets and some merging
everything for the final computation) - some taking on average 1 minute to
complete and some taking more than 30 minutes.

What would be for you the best strategy to port this to Apache Spark ?

   - Transform the whole flow into a Spark Job (PySpark or Scala)
   - Transform only part of the flow (the heavy lifting ~30 min parts)
   using the same language (PySpark)
   - Transform only part of the flow and pipe the rest from Scala to Python

Regards,

Olivier.