Re: How does preprocessing fit into Spark MLlib pipeline

Yan Facai Mon, 20 Mar 2017 22:35:10 -0700

SQLTransformer is a good solution if all operators are combined with SQL.

By the way,
if you like to get hands dirty,
writing a Transformer in scala is not hard,
and multiple output columns is valid in such case.





On Fri, Mar 17, 2017 at 9:10 PM, Yanbo Liang <yblia...@gmail.com> wrote:

> Hi Adrian,
>
> Did you try SQLTransformer? Your preprocessing steps are SQL operations
> and can be handled by SQLTransformer in MLlib pipeline scope.
>
> Thanks
> Yanbo
>
> On Thu, Mar 9, 2017 at 11:02 AM, aATv <adr...@vidora.com> wrote:
>
>> I want to start using PySpark Mllib pipelines, but I don't understand
>> how/where preprocessing fits into the pipeline.
>>
>> My preprocessing steps are generally in the following form:
>>    1) Load log files(from s3) and parse into a spark Dataframe with
>> columns
>> user_id, event_type, timestamp, etc
>>    2) Group by a column, then pivot and count another column
>>       - e.g. df.groupby("user_id").pivot("event_type").count()
>>       - We can think of the columns that this creates besides user_id as
>> features, where the number of each event type is a different feature
>>    3) Join the data from step 1 with other metadata, usually stored in
>> Cassandra. Then perform a transformation similar to one from step 2),
>> where
>> the column that is pivoted and counted is a column that came from the data
>> stored in Cassandra.
>>
>> After this preprocessing, I would use transformers to create other
>> features
>> and feed it into a model, lets say Logistic Regression for example.
>>
>> I would like to make at lease step 2 a custom transformer and add that to
>> a
>> pipeline, but it doesn't fit the transformer abstraction. This is because
>> it
>> takes a single input column and outputs multiple columns.  It also has a
>> different number of input rows than output rows due to the group by
>> operation.
>>
>> Given that, how do I fit this into a Mllib pipeline, and it if doesn't fit
>> as part of a pipeline, what is the best way to include it in my code so
>> that
>> it can easily be reused both for training and testing, as well as in
>> production.
>>
>> I'm using pyspark 2.1 and here is an example of 2)
>>
>>
>>
>>
>> Note: My question is in some way related to this question, but I don't
>> think
>> it is answered here:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> Why-can-t-a-Transformer-have-multiple-output-columns-td18689.html
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/
>> Why-can-t-a-Transformer-have-multiple-output-columns-td18689.html>
>>
>> Thanks
>> Adrian
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-does-preprocessing-fit-into-Spark-
>> MLlib-pipeline-tp28473.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: How does preprocessing fit into Spark MLlib pipeline

Reply via email to