Fwd: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Michał Zieliński Sat, 26 Mar 2016 06:13:22 -0700

Ted,

Sure. This was presented by my colleague during Data Science London meetup.
The talk was about "Scalable Predictive Pipelines with Spark & Scala". Link
to the meetup and slides below:


http://www.meetup.com/Data-Science-London/events/229755935/
http://files.meetup.com/3183732/Scalable%20Predictive%20Pipelines%20with%20Spark%20and%20Scala.pdf


---------- Forwarded message ----------
From: Ted Yu <yuzhih...@gmail.com>
Date: 26 March 2016 at 12:51
Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to
DataFrames)?
To: Michał Zieliński <zielinski.mich...@gmail.com>


Michal:
Can you share the slide deck ?

Thanks

On Mar 26, 2016, at 4:10 AM, Michał Zieliński <zielinski.mich...@gmail.com>
wrote:

Spark ML Pipelines API (not just Transformers, Estimators and custom
Pipelines classes as well) are definitely not just machine-learning
specific.

We use them heavily in our developement. We're building machine learning
pipelines *BUT* many steps involve joining, schema manipulation,
pre/postprocessing data for the actual statistical algorithm, having
monoidal architecture (I have a slide deck if you're interested).

Pipelines API is a powerful abstraction that makes things very easy for us.
They are not always perfect (imho transformSchema is a little bit of a
mess, maybe future Dataset API will help), but they make our pipelines very
customisable and pluggable (you can add/swap/remove any PipelineStage and
any point).

On 26 March 2016 at 09:26, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Joseph,
>
> Thanks for the response. I'm one who doesn't understand all the
> hype/need for Machine Learning...yet and through Spark ML(lib) glasses
> I'm looking at ML space. In the meantime I've got few assignments (in
> a project with Spark and Scala) that have required quite extensive
> dataset manipulation.
>
> It was when I sinked into using DataFrame/Dataset for data
> manipulation not RDD (I remember talking to Brian about how RDD is an
> "assembly" language comparing to the higher-level concept of
> DataFrames with Catalysts and other optimizations). After few days
> with DataFrame I learnt he was so right! (sorry Brian, it took me
> longer to understand your point).
>
> I started using DataFrames in far too many places than one could ever
> accept :-) I was so...carried away with DataFrames (esp. show vs
> foreach(println) and UDFs via udf() function)
>
> And then, when I moved to Pipeline API and discovered Transformers.
> And PipelineStage that can create pipelines of DataFrame manipulation.
> They read so well that I'm pretty sure people would love using them
> more often, but...they belong to MLlib so they are part of ML space
> (not many devs tackled yet). I applied the approach to using
> withColumn to have better debugging experience (if I ever need it). I
> learnt it after having watched your presentation about Pipeline API.
> It was so helpful in my RDD/DataFrame space.
>
> So, to promote a more extensive use of Pipelines, PipelineStages, and
> Transformers, I was thinking about moving that part to SQL/DataFrame
> API where they really belong. If not, I think people might miss the
> beauty of the very fine and so helpful Transformers.
>
> Transformers are *not* a ML thing -- they are DataFrame thing and
> should be where they really belong (for their greater adoption).
>
> What do you think?
>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
> > There have been some comments about using Pipelines outside of ML, but I
> > have not yet seen a real need for it.  If a user does want to use
> Pipelines
> > for non-ML tasks, they still can use Transformers + PipelineModels.  Will
> > that work?
> >
> > On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl>
> wrote:
> >>
> >> Hi,
> >>
> >> After few weeks with spark.ml now, I came to conclusion that
> >> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
> >> of DataFrame (SQL) where they fit better. Are there any plans to
> >> migrate Transformer API (ML) to DataFrame (SQL)?
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> ----
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Fwd: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

Reply via email to