Re: Building several models in parallel
It's perfectly fine. That way you will construct a data flow graph which will calculate the linear model for all different groups once you define an output and trigger the execution. On Thu, Jul 9, 2015 at 6:05 PM, Felix Neutatz wrote: > The problem is that I am not able to incorporate this into a Flink > iteration since the MultipleLinearRegression() already contains an > iteration. Therefore this would be a nested iteration ;-) > > So at the moment I just use a for loop like this: > > data: DataSet[model_ID, DataVector] > > for (current_model_ID <- uniqueModel_IDs){ > > trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t => > t.DataVector) > > val mlr = MultipleLinearRegression() > > .setIterations(10) > .setStepsize(stepsize) > > mlr.fit(trainingDS) > > mlr.predict() > > ... > > } > > Is there a more efficient way to do this? > > Thank you for your help, > > Felix > > > > > 2015-07-08 10:58 GMT+02:00 Till Rohrmann : > > > Yes it is. But you can still run the calculation in parallel because > `fit` > > does not trigger the execution of the job graph. It simply builds the > data > > flow. Only if you call `predict` or collect the weights, it is executed. > > > > Cheers, > > Till > > > > On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz > > wrote: > > > > > Thanks for the information Till :) > > > > > > So at the moment the iteration is the only way. > > > > > > Best regards, > > > Felix > > > > > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann : > > > > > > > Hi Felix, > > > > > > > > this is currently not supported by FlinkML. The > > MultipleLinearRegression > > > > algorithm expects a DataSet and not a GroupedDataSet as input. What > you > > > can > > > > do is to extract each group from the original DataSet by using a > filter > > > > operation. Once you have done this, you can train the linear model on > > > each > > > > sub part of the DataSet. > > > > > > > > Cheers, > > > > Till > > > > > > > > > > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz < > neut...@googlemail.com > > > > > > > wrote: > > > > > > > > > Hi Felix, > > > > > > > > > > thanks for the idea. But doesn't this mean that I can only train > one > > > > model > > > > > per partition? The thing is, I have way more models than that :( > > > > > > > > > > Best regards, > > > > > Felix > > > > > > > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler : > > > > > > > > > > > Hi Felix! > > > > > > > > > > > > We had a similar usecase and I trained multiple models on > > partitions > > > of > > > > > > my data with mapPartition and the model-parameters (weights) as > > > > > > broadcast variable. If I understood broadcast variables in Flink > > > > > > correctly, you should end up with one model on each TaskManager. > > > > > > > > > > > > Does that work? > > > > > > > > > > > > Felix > > > > > > > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > > > > Hi, > > > > > > > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > > > > > > > So what I want to do is group by the model_ID and build for > each > > > > > model_ID > > > > > > > one regression model > > > > > > > > > > > > > > in pseudo code: > > > > > > > data.groupBy(model_ID) > > > > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > > > > > > > Is there anyway besides an iteration how to do this at the > > moment? > > > > > > > > > > > > > > Thanks for your help, > > > > > > > > > > > > > > Felix Neutatz > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Building several models in parallel
The problem is that I am not able to incorporate this into a Flink iteration since the MultipleLinearRegression() already contains an iteration. Therefore this would be a nested iteration ;-) So at the moment I just use a for loop like this: data: DataSet[model_ID, DataVector] for (current_model_ID <- uniqueModel_IDs){ trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t => t.DataVector) val mlr = MultipleLinearRegression() .setIterations(10) .setStepsize(stepsize) mlr.fit(trainingDS) mlr.predict() ... } Is there a more efficient way to do this? Thank you for your help, Felix 2015-07-08 10:58 GMT+02:00 Till Rohrmann : > Yes it is. But you can still run the calculation in parallel because `fit` > does not trigger the execution of the job graph. It simply builds the data > flow. Only if you call `predict` or collect the weights, it is executed. > > Cheers, > Till > > On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz > wrote: > > > Thanks for the information Till :) > > > > So at the moment the iteration is the only way. > > > > Best regards, > > Felix > > > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann : > > > > > Hi Felix, > > > > > > this is currently not supported by FlinkML. The > MultipleLinearRegression > > > algorithm expects a DataSet and not a GroupedDataSet as input. What you > > can > > > do is to extract each group from the original DataSet by using a filter > > > operation. Once you have done this, you can train the linear model on > > each > > > sub part of the DataSet. > > > > > > Cheers, > > > Till > > > > > > > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz > > > > wrote: > > > > > > > Hi Felix, > > > > > > > > thanks for the idea. But doesn't this mean that I can only train one > > > model > > > > per partition? The thing is, I have way more models than that :( > > > > > > > > Best regards, > > > > Felix > > > > > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler : > > > > > > > > > Hi Felix! > > > > > > > > > > We had a similar usecase and I trained multiple models on > partitions > > of > > > > > my data with mapPartition and the model-parameters (weights) as > > > > > broadcast variable. If I understood broadcast variables in Flink > > > > > correctly, you should end up with one model on each TaskManager. > > > > > > > > > > Does that work? > > > > > > > > > > Felix > > > > > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > > > Hi, > > > > > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > > > > > So what I want to do is group by the model_ID and build for each > > > > model_ID > > > > > > one regression model > > > > > > > > > > > > in pseudo code: > > > > > > data.groupBy(model_ID) > > > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > > > > > Is there anyway besides an iteration how to do this at the > moment? > > > > > > > > > > > > Thanks for your help, > > > > > > > > > > > > Felix Neutatz > > > > > > > > > > > > > > > > > > > > >
Re: Building several models in parallel
Yes it is. But you can still run the calculation in parallel because `fit` does not trigger the execution of the job graph. It simply builds the data flow. Only if you call `predict` or collect the weights, it is executed. Cheers, Till On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz wrote: > Thanks for the information Till :) > > So at the moment the iteration is the only way. > > Best regards, > Felix > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann : > > > Hi Felix, > > > > this is currently not supported by FlinkML. The MultipleLinearRegression > > algorithm expects a DataSet and not a GroupedDataSet as input. What you > can > > do is to extract each group from the original DataSet by using a filter > > operation. Once you have done this, you can train the linear model on > each > > sub part of the DataSet. > > > > Cheers, > > Till > > > > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz > > wrote: > > > > > Hi Felix, > > > > > > thanks for the idea. But doesn't this mean that I can only train one > > model > > > per partition? The thing is, I have way more models than that :( > > > > > > Best regards, > > > Felix > > > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler : > > > > > > > Hi Felix! > > > > > > > > We had a similar usecase and I trained multiple models on partitions > of > > > > my data with mapPartition and the model-parameters (weights) as > > > > broadcast variable. If I understood broadcast variables in Flink > > > > correctly, you should end up with one model on each TaskManager. > > > > > > > > Does that work? > > > > > > > > Felix > > > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > > Hi, > > > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > > > So what I want to do is group by the model_ID and build for each > > > model_ID > > > > > one regression model > > > > > > > > > > in pseudo code: > > > > > data.groupBy(model_ID) > > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > > > Is there anyway besides an iteration how to do this at the moment? > > > > > > > > > > Thanks for your help, > > > > > > > > > > Felix Neutatz > > > > > > > > > > > > > > >
Re: Building several models in parallel
Thanks for the information Till :) So at the moment the iteration is the only way. Best regards, Felix 2015-07-08 10:43 GMT+02:00 Till Rohrmann : > Hi Felix, > > this is currently not supported by FlinkML. The MultipleLinearRegression > algorithm expects a DataSet and not a GroupedDataSet as input. What you can > do is to extract each group from the original DataSet by using a filter > operation. Once you have done this, you can train the linear model on each > sub part of the DataSet. > > Cheers, > Till > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz > wrote: > > > Hi Felix, > > > > thanks for the idea. But doesn't this mean that I can only train one > model > > per partition? The thing is, I have way more models than that :( > > > > Best regards, > > Felix > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler : > > > > > Hi Felix! > > > > > > We had a similar usecase and I trained multiple models on partitions of > > > my data with mapPartition and the model-parameters (weights) as > > > broadcast variable. If I understood broadcast variables in Flink > > > correctly, you should end up with one model on each TaskManager. > > > > > > Does that work? > > > > > > Felix > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > Hi, > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > So what I want to do is group by the model_ID and build for each > > model_ID > > > > one regression model > > > > > > > > in pseudo code: > > > > data.groupBy(model_ID) > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > Is there anyway besides an iteration how to do this at the moment? > > > > > > > > Thanks for your help, > > > > > > > > Felix Neutatz > > > > > > > > > >
Re: Building several models in parallel
Hi Felix, this is currently not supported by FlinkML. The MultipleLinearRegression algorithm expects a DataSet and not a GroupedDataSet as input. What you can do is to extract each group from the original DataSet by using a filter operation. Once you have done this, you can train the linear model on each sub part of the DataSet. Cheers, Till On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz wrote: > Hi Felix, > > thanks for the idea. But doesn't this mean that I can only train one model > per partition? The thing is, I have way more models than that :( > > Best regards, > Felix > > 2015-07-07 22:37 GMT+02:00 Felix Schüler : > > > Hi Felix! > > > > We had a similar usecase and I trained multiple models on partitions of > > my data with mapPartition and the model-parameters (weights) as > > broadcast variable. If I understood broadcast variables in Flink > > correctly, you should end up with one model on each TaskManager. > > > > Does that work? > > > > Felix > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > Hi, > > > > > > at the moment I have a dataset which looks like this: > > > > > > DataSet[model_ID, DataVector] data > > > > > > So what I want to do is group by the model_ID and build for each > model_ID > > > one regression model > > > > > > in pseudo code: > > > data.groupBy(model_ID) > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > Is there anyway besides an iteration how to do this at the moment? > > > > > > Thanks for your help, > > > > > > Felix Neutatz > > > > > >
Re: Building several models in parallel
Hi Felix, thanks for the idea. But doesn't this mean that I can only train one model per partition? The thing is, I have way more models than that :( Best regards, Felix 2015-07-07 22:37 GMT+02:00 Felix Schüler : > Hi Felix! > > We had a similar usecase and I trained multiple models on partitions of > my data with mapPartition and the model-parameters (weights) as > broadcast variable. If I understood broadcast variables in Flink > correctly, you should end up with one model on each TaskManager. > > Does that work? > > Felix > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > Hi, > > > > at the moment I have a dataset which looks like this: > > > > DataSet[model_ID, DataVector] data > > > > So what I want to do is group by the model_ID and build for each model_ID > > one regression model > > > > in pseudo code: > > data.groupBy(model_ID) > > --> MultipleLinearRegression().fit(data_grouped) > > > > Is there anyway besides an iteration how to do this at the moment? > > > > Thanks for your help, > > > > Felix Neutatz > > >
Re: Building several models in parallel
Hi Felix! We had a similar usecase and I trained multiple models on partitions of my data with mapPartition and the model-parameters (weights) as broadcast variable. If I understood broadcast variables in Flink correctly, you should end up with one model on each TaskManager. Does that work? Felix Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > Hi, > > at the moment I have a dataset which looks like this: > > DataSet[model_ID, DataVector] data > > So what I want to do is group by the model_ID and build for each model_ID > one regression model > > in pseudo code: > data.groupBy(model_ID) > --> MultipleLinearRegression().fit(data_grouped) > > Is there anyway besides an iteration how to do this at the moment? > > Thanks for your help, > > Felix Neutatz >
Building several models in parallel
Hi, at the moment I have a dataset which looks like this: DataSet[model_ID, DataVector] data So what I want to do is group by the model_ID and build for each model_ID one regression model in pseudo code: data.groupBy(model_ID) --> MultipleLinearRegression().fit(data_grouped) Is there anyway besides an iteration how to do this at the moment? Thanks for your help, Felix Neutatz