The problem is that I am not able to incorporate this into a Flink iteration since the MultipleLinearRegression() already contains an iteration. Therefore this would be a nested iteration ;-)
So at the moment I just use a for loop like this: data: DataSet[model_ID, DataVector] for (current_model_ID <- uniqueModel_IDs){ trainingDS = data.filter(t => t.model_ID == current_model_ID).map(t => t.DataVector) val mlr = MultipleLinearRegression() .setIterations(10) .setStepsize(stepsize) mlr.fit(trainingDS) mlr.predict() ... } Is there a more efficient way to do this? Thank you for your help, Felix 2015-07-08 10:58 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > Yes it is. But you can still run the calculation in parallel because `fit` > does not trigger the execution of the job graph. It simply builds the data > flow. Only if you call `predict` or collect the weights, it is executed. > > Cheers, > Till > > On Wed, Jul 8, 2015 at 10:52 AM, Felix Neutatz <neut...@googlemail.com> > wrote: > > > Thanks for the information Till :) > > > > So at the moment the iteration is the only way. > > > > Best regards, > > Felix > > > > 2015-07-08 10:43 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > > > > > Hi Felix, > > > > > > this is currently not supported by FlinkML. The > MultipleLinearRegression > > > algorithm expects a DataSet and not a GroupedDataSet as input. What you > > can > > > do is to extract each group from the original DataSet by using a filter > > > operation. Once you have done this, you can train the linear model on > > each > > > sub part of the DataSet. > > > > > > Cheers, > > > Till > > > > > > > > > On Wed, Jul 8, 2015 at 10:37 AM, Felix Neutatz <neut...@googlemail.com > > > > > wrote: > > > > > > > Hi Felix, > > > > > > > > thanks for the idea. But doesn't this mean that I can only train one > > > model > > > > per partition? The thing is, I have way more models than that :( > > > > > > > > Best regards, > > > > Felix > > > > > > > > 2015-07-07 22:37 GMT+02:00 Felix Schüler <fschue...@posteo.de>: > > > > > > > > > Hi Felix! > > > > > > > > > > We had a similar usecase and I trained multiple models on > partitions > > of > > > > > my data with mapPartition and the model-parameters (weights) as > > > > > broadcast variable. If I understood broadcast variables in Flink > > > > > correctly, you should end up with one model on each TaskManager. > > > > > > > > > > Does that work? > > > > > > > > > > Felix > > > > > > > > > > Am 07.07.2015 um 17:32 schrieb Felix Neutatz: > > > > > > Hi, > > > > > > > > > > > > at the moment I have a dataset which looks like this: > > > > > > > > > > > > DataSet[model_ID, DataVector] data > > > > > > > > > > > > So what I want to do is group by the model_ID and build for each > > > > model_ID > > > > > > one regression model > > > > > > > > > > > > in pseudo code: > > > > > > data.groupBy(model_ID) > > > > > > --> MultipleLinearRegression().fit(data_grouped) > > > > > > > > > > > > Is there anyway besides an iteration how to do this at the > moment? > > > > > > > > > > > > Thanks for your help, > > > > > > > > > > > > Felix Neutatz > > > > > > > > > > > > > > > > > > > > >