Re: Building several models in parallel

2015-07-09 Thread Till Rohrmann
It's perfectly fine. That way you will construct a data flow graph which
will calculate the linear model for all different groups once you define an
output and trigger the execution.

On Thu, Jul 9, 2015 at 6:05 PM, Felix Neutatz 
wrote:

> The problem is that I am not able to incorporate this into a Flink
> iteration since the MultipleLinearRegression() already contains an
> iteration. Therefore this would be a nested iteration ;-)
>
> So at the moment I just use a for loop like this:
>
> data: DataSet[model_ID, DataVector]
>
> for (current_model_ID <- uniqueModel_IDs){
>
> trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t =>
> t.DataVector)
>
> val mlr = MultipleLinearRegression()
>
> .setIterations(10)
> .setStepsize(stepsize)
>
> mlr.fit(trainingDS)
>
> mlr.predict()
>
> ...
>
> }
>
> Is there a more efficient way to do this?
>
> Thank you for your help,
>
> Felix
>
>
>
>
> 2015-07-08 10:58 GMT+02:00 Till Rohrmann :
>
> > Yes it is. But you can still run the calculation in parallel because
> `fit`
> > does not trigger the execution of the job graph. It simply builds the
> data
> > flow. Only if you call `predict` or collect the weights, it is executed.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz 
> > wrote:
> >
> > > Thanks for the information Till :)
> > >
> > > So at the moment the iteration is the only way.
> > >
> > > Best regards,
> > > Felix
> > >
> > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann :
> > >
> > > > Hi Felix,
> > > >
> > > > this is currently not supported by FlinkML. The
> > MultipleLinearRegression
> > > > algorithm expects a DataSet and not a GroupedDataSet as input. What
> you
> > > can
> > > > do is to extract each group from the original DataSet by using a
> filter
> > > > operation. Once you have done this, you can train the linear model on
> > > each
> > > > sub part of the DataSet.
> > > >
> > > > Cheers,
> > > > Till
> > > > ​
> > > >
> > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz <
> neut...@googlemail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Felix,
> > > > >
> > > > > thanks for the idea. But doesn't this mean that I can only train
> one
> > > > model
> > > > > per partition? The thing is, I have way more models than that :(
> > > > >
> > > > > Best regards,
> > > > > Felix
> > > > >
> > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler :
> > > > >
> > > > > > Hi Felix!
> > > > > >
> > > > > > We had a similar usecase and I trained multiple models on
> > partitions
> > > of
> > > > > > my data with mapPartition and the model-parameters (weights) as
> > > > > > broadcast variable. If I understood broadcast variables in Flink
> > > > > > correctly, you should end up with one model on each TaskManager.
> > > > > >
> > > > > > Does that work?
> > > > > >
> > > > > > Felix
> > > > > >
> > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > > > > > > Hi,
> > > > > > >
> > > > > > > at the moment I have a dataset which looks like this:
> > > > > > >
> > > > > > > DataSet[model_ID, DataVector] data
> > > > > > >
> > > > > > > So what I want to do is group by the model_ID and build for
> each
> > > > > model_ID
> > > > > > > one regression model
> > > > > > >
> > > > > > > in pseudo code:
> > > > > > > data.groupBy(model_ID)
> > > > > > > --> MultipleLinearRegression().fit(data_grouped)
> > > > > > >
> > > > > > > Is there anyway besides an iteration how to do this at the
> > moment?
> > > > > > >
> > > > > > > Thanks for your help,
> > > > > > >
> > > > > > > Felix Neutatz
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Building several models in parallel

2015-07-09 Thread Felix Neutatz
The problem is that I am not able to incorporate this into a Flink
iteration since the MultipleLinearRegression() already contains an
iteration. Therefore this would be a nested iteration ;-)

So at the moment I just use a for loop like this:

data: DataSet[model_ID, DataVector]

for (current_model_ID <- uniqueModel_IDs){

trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t =>
t.DataVector)

val mlr = MultipleLinearRegression()

.setIterations(10)
.setStepsize(stepsize)

mlr.fit(trainingDS)

mlr.predict()

...

}

Is there a more efficient way to do this?

Thank you for your help,

Felix




2015-07-08 10:58 GMT+02:00 Till Rohrmann :

> Yes it is. But you can still run the calculation in parallel because `fit`
> does not trigger the execution of the job graph. It simply builds the data
> flow. Only if you call `predict` or collect the weights, it is executed.
>
> Cheers,
> Till
>
> On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz 
> wrote:
>
> > Thanks for the information Till :)
> >
> > So at the moment the iteration is the only way.
> >
> > Best regards,
> > Felix
> >
> > 2015-07-08 10:43 GMT+02:00 Till Rohrmann :
> >
> > > Hi Felix,
> > >
> > > this is currently not supported by FlinkML. The
> MultipleLinearRegression
> > > algorithm expects a DataSet and not a GroupedDataSet as input. What you
> > can
> > > do is to extract each group from the original DataSet by using a filter
> > > operation. Once you have done this, you can train the linear model on
> > each
> > > sub part of the DataSet.
> > >
> > > Cheers,
> > > Till
> > > ​
> > >
> > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz  >
> > > wrote:
> > >
> > > > Hi Felix,
> > > >
> > > > thanks for the idea. But doesn't this mean that I can only train one
> > > model
> > > > per partition? The thing is, I have way more models than that :(
> > > >
> > > > Best regards,
> > > > Felix
> > > >
> > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler :
> > > >
> > > > > Hi Felix!
> > > > >
> > > > > We had a similar usecase and I trained multiple models on
> partitions
> > of
> > > > > my data with mapPartition and the model-parameters (weights) as
> > > > > broadcast variable. If I understood broadcast variables in Flink
> > > > > correctly, you should end up with one model on each TaskManager.
> > > > >
> > > > > Does that work?
> > > > >
> > > > > Felix
> > > > >
> > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > > > > > Hi,
> > > > > >
> > > > > > at the moment I have a dataset which looks like this:
> > > > > >
> > > > > > DataSet[model_ID, DataVector] data
> > > > > >
> > > > > > So what I want to do is group by the model_ID and build for each
> > > > model_ID
> > > > > > one regression model
> > > > > >
> > > > > > in pseudo code:
> > > > > > data.groupBy(model_ID)
> > > > > > --> MultipleLinearRegression().fit(data_grouped)
> > > > > >
> > > > > > Is there anyway besides an iteration how to do this at the
> moment?
> > > > > >
> > > > > > Thanks for your help,
> > > > > >
> > > > > > Felix Neutatz
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Building several models in parallel

2015-07-08 Thread Till Rohrmann
Yes it is. But you can still run the calculation in parallel because `fit`
does not trigger the execution of the job graph. It simply builds the data
flow. Only if you call `predict` or collect the weights, it is executed.

Cheers,
Till

On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz 
wrote:

> Thanks for the information Till :)
>
> So at the moment the iteration is the only way.
>
> Best regards,
> Felix
>
> 2015-07-08 10:43 GMT+02:00 Till Rohrmann :
>
> > Hi Felix,
> >
> > this is currently not supported by FlinkML. The MultipleLinearRegression
> > algorithm expects a DataSet and not a GroupedDataSet as input. What you
> can
> > do is to extract each group from the original DataSet by using a filter
> > operation. Once you have done this, you can train the linear model on
> each
> > sub part of the DataSet.
> >
> > Cheers,
> > Till
> > ​
> >
> > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz 
> > wrote:
> >
> > > Hi Felix,
> > >
> > > thanks for the idea. But doesn't this mean that I can only train one
> > model
> > > per partition? The thing is, I have way more models than that :(
> > >
> > > Best regards,
> > > Felix
> > >
> > > 2015-07-07 22:37 GMT+02:00 Felix Schüler :
> > >
> > > > Hi Felix!
> > > >
> > > > We had a similar usecase and I trained multiple models on partitions
> of
> > > > my data with mapPartition and the model-parameters (weights) as
> > > > broadcast variable. If I understood broadcast variables in Flink
> > > > correctly, you should end up with one model on each TaskManager.
> > > >
> > > > Does that work?
> > > >
> > > > Felix
> > > >
> > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > > > > Hi,
> > > > >
> > > > > at the moment I have a dataset which looks like this:
> > > > >
> > > > > DataSet[model_ID, DataVector] data
> > > > >
> > > > > So what I want to do is group by the model_ID and build for each
> > > model_ID
> > > > > one regression model
> > > > >
> > > > > in pseudo code:
> > > > > data.groupBy(model_ID)
> > > > > --> MultipleLinearRegression().fit(data_grouped)
> > > > >
> > > > > Is there anyway besides an iteration how to do this at the moment?
> > > > >
> > > > > Thanks for your help,
> > > > >
> > > > > Felix Neutatz
> > > > >
> > > >
> > >
> >
>


Re: Building several models in parallel

2015-07-08 Thread Felix Neutatz
Thanks for the information Till :)

So at the moment the iteration is the only way.

Best regards,
Felix

2015-07-08 10:43 GMT+02:00 Till Rohrmann :

> Hi Felix,
>
> this is currently not supported by FlinkML. The MultipleLinearRegression
> algorithm expects a DataSet and not a GroupedDataSet as input. What you can
> do is to extract each group from the original DataSet by using a filter
> operation. Once you have done this, you can train the linear model on each
> sub part of the DataSet.
>
> Cheers,
> Till
> ​
>
> On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz 
> wrote:
>
> > Hi Felix,
> >
> > thanks for the idea. But doesn't this mean that I can only train one
> model
> > per partition? The thing is, I have way more models than that :(
> >
> > Best regards,
> > Felix
> >
> > 2015-07-07 22:37 GMT+02:00 Felix Schüler :
> >
> > > Hi Felix!
> > >
> > > We had a similar usecase and I trained multiple models on partitions of
> > > my data with mapPartition and the model-parameters (weights) as
> > > broadcast variable. If I understood broadcast variables in Flink
> > > correctly, you should end up with one model on each TaskManager.
> > >
> > > Does that work?
> > >
> > > Felix
> > >
> > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > > > Hi,
> > > >
> > > > at the moment I have a dataset which looks like this:
> > > >
> > > > DataSet[model_ID, DataVector] data
> > > >
> > > > So what I want to do is group by the model_ID and build for each
> > model_ID
> > > > one regression model
> > > >
> > > > in pseudo code:
> > > > data.groupBy(model_ID)
> > > > --> MultipleLinearRegression().fit(data_grouped)
> > > >
> > > > Is there anyway besides an iteration how to do this at the moment?
> > > >
> > > > Thanks for your help,
> > > >
> > > > Felix Neutatz
> > > >
> > >
> >
>


Re: Building several models in parallel

2015-07-08 Thread Till Rohrmann
Hi Felix,

this is currently not supported by FlinkML. The MultipleLinearRegression
algorithm expects a DataSet and not a GroupedDataSet as input. What you can
do is to extract each group from the original DataSet by using a filter
operation. Once you have done this, you can train the linear model on each
sub part of the DataSet.

Cheers,
Till
​

On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz 
wrote:

> Hi Felix,
>
> thanks for the idea. But doesn't this mean that I can only train one model
> per partition? The thing is, I have way more models than that :(
>
> Best regards,
> Felix
>
> 2015-07-07 22:37 GMT+02:00 Felix Schüler :
>
> > Hi Felix!
> >
> > We had a similar usecase and I trained multiple models on partitions of
> > my data with mapPartition and the model-parameters (weights) as
> > broadcast variable. If I understood broadcast variables in Flink
> > correctly, you should end up with one model on each TaskManager.
> >
> > Does that work?
> >
> > Felix
> >
> > Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > > Hi,
> > >
> > > at the moment I have a dataset which looks like this:
> > >
> > > DataSet[model_ID, DataVector] data
> > >
> > > So what I want to do is group by the model_ID and build for each
> model_ID
> > > one regression model
> > >
> > > in pseudo code:
> > > data.groupBy(model_ID)
> > > --> MultipleLinearRegression().fit(data_grouped)
> > >
> > > Is there anyway besides an iteration how to do this at the moment?
> > >
> > > Thanks for your help,
> > >
> > > Felix Neutatz
> > >
> >
>


Re: Building several models in parallel

2015-07-08 Thread Felix Neutatz
Hi Felix,

thanks for the idea. But doesn't this mean that I can only train one model
per partition? The thing is, I have way more models than that :(

Best regards,
Felix

2015-07-07 22:37 GMT+02:00 Felix Schüler :

> Hi Felix!
>
> We had a similar usecase and I trained multiple models on partitions of
> my data with mapPartition and the model-parameters (weights) as
> broadcast variable. If I understood broadcast variables in Flink
> correctly, you should end up with one model on each TaskManager.
>
> Does that work?
>
> Felix
>
> Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> > Hi,
> >
> > at the moment I have a dataset which looks like this:
> >
> > DataSet[model_ID, DataVector] data
> >
> > So what I want to do is group by the model_ID and build for each model_ID
> > one regression model
> >
> > in pseudo code:
> > data.groupBy(model_ID)
> > --> MultipleLinearRegression().fit(data_grouped)
> >
> > Is there anyway besides an iteration how to do this at the moment?
> >
> > Thanks for your help,
> >
> > Felix Neutatz
> >
>


Re: Building several models in parallel

2015-07-07 Thread Felix Schüler
Hi Felix!

We had a similar usecase and I trained multiple models on partitions of
my data with mapPartition and the model-parameters (weights) as
broadcast variable. If I understood broadcast variables in Flink
correctly, you should end up with one model on each TaskManager.

Does that work?

Felix

Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
> Hi,
> 
> at the moment I have a dataset which looks like this:
> 
> DataSet[model_ID, DataVector] data
> 
> So what I want to do is group by the model_ID and build for each model_ID
> one regression model
> 
> in pseudo code:
> data.groupBy(model_ID)
> --> MultipleLinearRegression().fit(data_grouped)
> 
> Is there anyway besides an iteration how to do this at the moment?
> 
> Thanks for your help,
> 
> Felix Neutatz
> 


Building several models in parallel

2015-07-07 Thread Felix Neutatz
Hi,

at the moment I have a dataset which looks like this:

DataSet[model_ID, DataVector] data

So what I want to do is group by the model_ID and build for each model_ID
one regression model

in pseudo code:
data.groupBy(model_ID)
--> MultipleLinearRegression().fit(data_grouped)

Is there anyway besides an iteration how to do this at the moment?

Thanks for your help,

Felix Neutatz