Hi Ayan, "classification algorithm will for sure need to Fit against new dataset to produce new model" I said this in context of re-training the model. Is it not correct? Isn't it part of re-training?
Thanks On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > "classification algorithm will for sure need to Fit against new dataset > to produce new model" - I do not think this is correct. Maybe we are > talking semantics but AFAIU, you "train" one model using some dataset, and > then use it for scoring new datasets. > > You may re-train every month, yes. And you may run cross validation once a > month (after re-training) or lower freq like once in 2-3 months to validate > model quality. Here, number of months are not important, but you must be > running cross validation and similar sort of "model evaluation" work flow > typically in lower frequency than Re-Training process. > > On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > >> Hi Ayan, >> After deployment, we might re-train it every month. That is whole >> different problem I have explored yet. classification algorithm will for >> sure need to Fit against new dataset to produce new model. Correct me if I >> am wrong but I think I will also FIt new IDF model based on new dataset. At >> that time as well I will follow same training-validation split (or >> corss-validation) to evaluate model performance on new data before >> releasing it to make prediction. So afik , every time you need to re-train >> model you will need to corss validate using some data split strategy. >> >> I think spark ML document should start explaining mathematical model or >> simple algorithm what Fit and Transform means for particular algorithm >> (IDF, NaiveBayes) >> >> Thanks >> >> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> I have come across similar situation recently and decided to run >>> Training workflow less frequently than scoring workflow. >>> >>> In your use case I would imagine you will run IDF fit workflow once in >>> say a week. It will produce a model object which will be saved. In scoring >>> workflow, you will typically see new unseen dataset and the model generated >>> in training flow will be used to score or label this new dataset. >>> >>> Note, train and test datasets are used during development phase when you >>> are trying to find out which model to use and >>> efficientcy/performance/accuracy >>> etc. It will never be part of workflow. In a little elaborate setting you >>> may want to automate model evaluations, but that's a different story. >>> >>> Not sure if I could explain properly, please feel free to comment. >>> On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote: >>> >>>> Yes, I do apply NaiveBayes after IDF . >>>> >>>> " you can re-train (fit) on all your data before applying it to unseen >>>> data." Did you mean I can reuse that model to Transform both training and >>>> test data? >>>> >>>> Here's the process: >>>> >>>> Datasets: >>>> >>>> 1. Full sample data (labeled) >>>> 2. Training (labeled) >>>> 3. Test (labeled) >>>> 4. Unseen (non-labeled) >>>> >>>> Here are two workflow options I see: >>>> >>>> Option - 1 (currently using) >>>> >>>> 1. Fit IDF model (idf-1) on full Sample data >>>> 2. Apply(Transform) idf-1 on full sample data >>>> 3. Split data set into Training and Test data >>>> 4. Fit ML model on Training data >>>> 5. Apply(Transform) model on Test data >>>> 6. Apply(Transform) idf-1 on Unseen data >>>> 7. Apply(Transform) model on Unseen data >>>> >>>> Option - 2 >>>> >>>> 1. Split sample data into Training and Test data >>>> 2. Fit IDF model (idf-1) only on training data >>>> 3. Apply(Transform) idf-1 on training data >>>> 4. Apply(Transform) idf-1 on test data >>>> 5. Fit ML model on Training data >>>> 6. Apply(Transform) model on Test data >>>> 7. Apply(Transform) idf-1 on Unseen data >>>> 8. Apply(Transform) model on Unseen data >>>> >>>> So you are suggesting Option-2 in this particular case, right? >>>> >>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk> >>>> wrote: >>>> >>>>> Fit it on training data to evaluate the model. You can either use that >>>>> model to apply to unseen data or you can re-train (fit) on all your data >>>>> before applying it to unseen data. >>>>> >>>>> fit and transform are 2 different things: fit creates a model, >>>>> transform applies a model to data to create transformed output. If you are >>>>> using your training data in a subsequent step (e.g. running logistic >>>>> regression or some other machine learning algorithm) then you need to >>>>> transform your training data using the IDF model before passing it through >>>>> the next step. >>>>> >>>>> ------------------------------------------------------------ >>>>> ------------------- >>>>> Robin East >>>>> *Spark GraphX in Action* Michael Malak and Robin East >>>>> Manning Publications Co. >>>>> http://www.manning.com/books/spark-graphx-in-action >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote: >>>>> >>>>> Just to re-iterate what you said, I should fit IDF model only on >>>>> training data and then re-use it for both test data and then later on >>>>> unseen data to make predictions. >>>>> >>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk> >>>>> wrote: >>>>> >>>>>> The point of setting aside a portion of your data as a test set is to >>>>>> try and mimic applying your model to unseen data. If you fit your IDF >>>>>> model >>>>>> to all your data, any evaluation you perform on your test set is likely >>>>>> to >>>>>> over perform compared to ‘real’ unseen data. Effectively you would have >>>>>> overfit your model. >>>>>> ------------------------------------------------------------ >>>>>> ------------------- >>>>>> Robin East >>>>>> *Spark GraphX in Action* Michael Malak and Robin East >>>>>> Manning Publications Co. >>>>>> http://www.manning.com/books/spark-graphx-in-action >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote: >>>>>> >>>>>> FYI, I do reuse IDF model while making prediction against new >>>>>> unlabeled data but not between training and test data while training a >>>>>> model. >>>>>> >>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com> >>>>>> wrote: >>>>>> >>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features >>>>>>> into vectors. Currently, I fit IDF model on all sample data and then >>>>>>> transform them. I read somewhere that I should split my data into >>>>>>> training >>>>>>> and test before fitting IDF model; Fit IDF only on training data and >>>>>>> then >>>>>>> use same transformer to transform training and test data. >>>>>>> This raise more questions: >>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting >>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is >>>>>>> to >>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ? >>>>>>> 2) If not then fitting and transforming seems redundant for IDF model >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [image: What's New with Xactly] >>>>>> <http://www.xactlycorp.com/email-click/> >>>>>> >>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>> <http://www.youtube.com/xactlycorporation> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> [image: What's New with Xactly] >>>>> <http://www.xactlycorp.com/email-click/> >>>>> >>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>> <http://www.youtube.com/xactlycorporation> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> [image: What's New with Xactly] >>>> <http://www.xactlycorp.com/email-click/> >>>> >>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>>> <https://twitter.com/Xactly> [image: Facebook] >>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>> <http://www.youtube.com/xactlycorporation> >>> >>> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> >> > > > > -- > Best Regards, > Ayan Guha > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>