It's perfectly fine. That way you will construct a data flow graph which will calculate the linear model for all different groups once you define an output and trigger the execution.
On Thu, Jul 9, 2015 at 6:05 PM, Felix Neutatz <neut...@googlemail.com> wrote: > The problem is that I am not able to incorporate this into a Flink > iteration since the MultipleLinearRegression() already contains an > iteration. Therefore this would be a nested iteration ;-) > > So at the moment I just use a for loop like this: > > data: DataSet[model_ID, DataVector] > > for (current_model_ID <- uniqueModel_IDs){ > > trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t => > t.DataVector) > > val mlr = MultipleLinearRegression() > > .setIterations(10) > .setStepsize(stepsize) > > mlr.fit(trainingDS) > > mlr.predict() > > ... > > } > > Is there a more efficient way to do this? > > Thank you for your help, > > Felix > > > > > 2015-07-08 10:58 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > > > Yes it is. But you can still run the calculation in parallel because > `fit` > > does not trigger the execution of the job graph. It simply builds the > data > > flow. Only if you call `predict` or collect the weights, it is executed. > > > > Cheers, > > Till > > > > On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz <neut...@googlemail.com> > > wrote: > > > > > Thanks for the information Till :) > > > > > > So at the moment the iteration is the only way. > > > > > > Best regards, > > > Felix > > > > > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > > > > > > > Hi Felix, > > > > > > > > this is currently not supported by FlinkML. The > > MultipleLinearRegression > > > > algorithm expects a DataSet and not a GroupedDataSet as input. What > you > > > can > > > > do is to extract each group from the original DataSet by using a > filter > > > > operation. Once you have done this, you can train the linear model on > > > each > > > > sub part of the DataSet. > > > > > > > > Cheers, > > > > Till > > > > > > > > > > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz < > neut...@googlemail.com > > > > > > > wrote: > > > > > > > > > Hi Felix, > > > > > > > > > > thanks for the idea. But doesn't this mean that I can only train > one > > > > model > > > > > per partition? The thing is, I have way more models than that :( > > > > > > > > > > Best regards, > > > > > Felix > > > > > > > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler <fschue...@posteo.de>: > > > > > > > > > > > Hi Felix! > > > > > > > > > > > > We had a similar usecase and I trained multiple models on > > partitions > > > of > > > > > > my data with mapPartition and the model-parameters (weights) as > > > > > > broadcast variable. If I understood broadcast variables in Flink > > > > > > correctly, you should end up with one model on each TaskManager. > > > > > > > > > > > > Does that work? > > > > > > > > > > > > Felix > > > > > > > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > > > > Hi, > > > > > > > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > > > > > > > So what I want to do is group by the model_ID and build for > each > > > > > model_ID > > > > > > > one regression model > > > > > > > > > > > > > > in pseudo code: > > > > > > > data.groupBy(model_ID) > > > > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > > > > > > > Is there anyway besides an iteration how to do this at the > > moment? > > > > > > > > > > > > > > Thanks for your help, > > > > > > > > > > > > > > Felix Neutatz > > > > > > > > > > > > > > > > > > > > > > > > > > > >