Hi Ayan, After deployment, we might re-train it every month. That is whole different problem I have explored yet. classification algorithm will for sure need to Fit against new dataset to produce new model. Correct me if I am wrong but I think I will also FIt new IDF model based on new dataset. At that time as well I will follow same training-validation split (or corss-validation) to evaluate model performance on new data before releasing it to make prediction. So afik , every time you need to re-train model you will need to corss validate using some data split strategy.
I think spark ML document should start explaining mathematical model or simple algorithm what Fit and Transform means for particular algorithm (IDF, NaiveBayes) Thanks On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote: > I have come across similar situation recently and decided to run Training > workflow less frequently than scoring workflow. > > In your use case I would imagine you will run IDF fit workflow once in say > a week. It will produce a model object which will be saved. In scoring > workflow, you will typically see new unseen dataset and the model generated > in training flow will be used to score or label this new dataset. > > Note, train and test datasets are used during development phase when you > are trying to find out which model to use and efficientcy/performance/accuracy > etc. It will never be part of workflow. In a little elaborate setting you > may want to automate model evaluations, but that's a different story. > > Not sure if I could explain properly, please feel free to comment. > On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote: > >> Yes, I do apply NaiveBayes after IDF . >> >> " you can re-train (fit) on all your data before applying it to unseen >> data." Did you mean I can reuse that model to Transform both training and >> test data? >> >> Here's the process: >> >> Datasets: >> >> 1. Full sample data (labeled) >> 2. Training (labeled) >> 3. Test (labeled) >> 4. Unseen (non-labeled) >> >> Here are two workflow options I see: >> >> Option - 1 (currently using) >> >> 1. Fit IDF model (idf-1) on full Sample data >> 2. Apply(Transform) idf-1 on full sample data >> 3. Split data set into Training and Test data >> 4. Fit ML model on Training data >> 5. Apply(Transform) model on Test data >> 6. Apply(Transform) idf-1 on Unseen data >> 7. Apply(Transform) model on Unseen data >> >> Option - 2 >> >> 1. Split sample data into Training and Test data >> 2. Fit IDF model (idf-1) only on training data >> 3. Apply(Transform) idf-1 on training data >> 4. Apply(Transform) idf-1 on test data >> 5. Fit ML model on Training data >> 6. Apply(Transform) model on Test data >> 7. Apply(Transform) idf-1 on Unseen data >> 8. Apply(Transform) model on Unseen data >> >> So you are suggesting Option-2 in this particular case, right? >> >> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk> >> wrote: >> >>> Fit it on training data to evaluate the model. You can either use that >>> model to apply to unseen data or you can re-train (fit) on all your data >>> before applying it to unseen data. >>> >>> fit and transform are 2 different things: fit creates a model, transform >>> applies a model to data to create transformed output. If you are using your >>> training data in a subsequent step (e.g. running logistic regression or >>> some other machine learning algorithm) then you need to transform your >>> training data using the IDF model before passing it through the next step. >>> >>> ------------------------------------------------------------ >>> ------------------- >>> Robin East >>> *Spark GraphX in Action* Michael Malak and Robin East >>> Manning Publications Co. >>> http://www.manning.com/books/spark-graphx-in-action >>> >>> >>> >>> >>> >>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote: >>> >>> Just to re-iterate what you said, I should fit IDF model only on >>> training data and then re-use it for both test data and then later on >>> unseen data to make predictions. >>> >>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk> >>> wrote: >>> >>>> The point of setting aside a portion of your data as a test set is to >>>> try and mimic applying your model to unseen data. If you fit your IDF model >>>> to all your data, any evaluation you perform on your test set is likely to >>>> over perform compared to ‘real’ unseen data. Effectively you would have >>>> overfit your model. >>>> ------------------------------------------------------------ >>>> ------------------- >>>> Robin East >>>> *Spark GraphX in Action* Michael Malak and Robin East >>>> Manning Publications Co. >>>> http://www.manning.com/books/spark-graphx-in-action >>>> >>>> >>>> >>>> >>>> >>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote: >>>> >>>> FYI, I do reuse IDF model while making prediction against new unlabeled >>>> data but not between training and test data while training a model. >>>> >>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com> >>>> wrote: >>>> >>>>> I am using IDF estimator/model (TF-IDF) to convert text features into >>>>> vectors. Currently, I fit IDF model on all sample data and then transform >>>>> them. I read somewhere that I should split my data into training and test >>>>> before fitting IDF model; Fit IDF only on training data and then use same >>>>> transformer to transform training and test data. >>>>> This raise more questions: >>>>> 1) Why would you do that? What exactly do IDF learn during fitting >>>>> process that it can reuse to transform any new dataset. Perhaps idea is to >>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ? >>>>> 2) If not then fitting and transforming seems redundant for IDF model >>>>> >>>> >>>> >>>> >>>> >>>> [image: What's New with Xactly] >>>> <http://www.xactlycorp.com/email-click/> >>>> >>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>>> <https://twitter.com/Xactly> [image: Facebook] >>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>> <http://www.youtube.com/xactlycorporation> >>>> >>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >>> >>> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> > > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>