How many iterations are you doing on the data? Like Jörn said, you don't necessarily need a billion samples for linear regression.
On Tue, Aug 22, 2017 at 6:28 PM, Sea aj <saj3...@gmail.com> wrote: > Jorn, > > My question is not about the model type but instead, the spark capability > on reusing any already trained ml model in training a new model. > > > > > On Tue, Aug 22, 2017 at 1:13 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Is it really required to have one billion samples for just linear >> regression? Probably your model would do equally well with much less >> samples. Have you checked bias and variance if you use much less random >> samples? >> >> On 22. Aug 2017, at 12:58, Sea aj <saj3...@gmail.com> wrote: >> >> I have a large dataframe of 1 billion rows of type LabeledPoint. I tried >> to train a linear regression model on the df but it failed due to lack of >> memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of >> CPU. >> >> I decided to split my data into multiple chunks and train the model in >> multiple phases but I learned the linear regression model in ml library >> does not have "setinitialmodel" function to be able to pass the trained >> model from one chunk to the rest of chunks. In another word, each time I >> call the fit function over a chunk of my data, it overwrites the previous >> mode. >> >> So far the only solution I found is using Spark Streaming to be able to >> split the data to multiple dfs and then train over each individually to >> overcome memory issue. >> >> Do you know if there's any other solution? >> >> >> >> >> On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar <jayantbaya...@gmail.com> >> wrote: >> >>> Hello Mahesh, >>> >>> We have built one. You can download from here : >>> https://www.sparkflows.io/download >>> >>> Feel free to ping me for any questions, etc. >>> >>> Best Regards, >>> Jayant >>> >>> >>> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker < >>> mahesh_sawai...@persistent.com> wrote: >>> >>>> Hi, >>>> >>>> >>>> 1) Is anyone aware of any workbench kind of tool to run ML jobs in >>>> spark. Specifically is the tool could be something like a Web application >>>> that is configured to connect to a spark cluster. >>>> >>>> >>>> User is able to select input training sets probably from hdfs , train >>>> and then run predictions, without having to write any Scala code. >>>> >>>> >>>> 2) If there is not tool, is there value in having such tool, what could >>>> be the challenges. >>>> >>>> >>>> Thanks, >>>> >>>> Mahesh >>>> >>>> >>>> DISCLAIMER >>>> ========== >>>> This e-mail may contain privileged and confidential information which >>>> is the property of Persistent Systems Ltd. It is intended only for the use >>>> of the individual or entity to which it is addressed. If you are not the >>>> intended recipient, you are not authorized to read, retain, copy, print, >>>> distribute or use this message. If you have received this communication in >>>> error, please notify the sender and delete all copies of this message. >>>> Persistent Systems Ltd. does not accept any liability for virus infected >>>> mails. >>>> >>> >>> >> > -- Cheers!