Yes, that is correct. I think I misread a part of it in terms of scoring....I think we both are saying same thing so thats a good thing :)
On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > Hi Ayan, > > "classification algorithm will for sure need to Fit against new dataset > to produce new model" I said this in context of re-training the model. Is > it not correct? Isn't it part of re-training? > > Thanks > > On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> "classification algorithm will for sure need to Fit against new dataset >> to produce new model" - I do not think this is correct. Maybe we are >> talking semantics but AFAIU, you "train" one model using some dataset, and >> then use it for scoring new datasets. >> >> You may re-train every month, yes. And you may run cross validation once >> a month (after re-training) or lower freq like once in 2-3 months to >> validate model quality. Here, number of months are not important, but you >> must be running cross validation and similar sort of "model evaluation" >> work flow typically in lower frequency than Re-Training process. >> >> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> >> wrote: >> >>> Hi Ayan, >>> After deployment, we might re-train it every month. That is whole >>> different problem I have explored yet. classification algorithm will for >>> sure need to Fit against new dataset to produce new model. Correct me if I >>> am wrong but I think I will also FIt new IDF model based on new dataset. At >>> that time as well I will follow same training-validation split (or >>> corss-validation) to evaluate model performance on new data before >>> releasing it to make prediction. So afik , every time you need to re-train >>> model you will need to corss validate using some data split strategy. >>> >>> I think spark ML document should start explaining mathematical model or >>> simple algorithm what Fit and Transform means for particular algorithm >>> (IDF, NaiveBayes) >>> >>> Thanks >>> >>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> I have come across similar situation recently and decided to run >>>> Training workflow less frequently than scoring workflow. >>>> >>>> In your use case I would imagine you will run IDF fit workflow once in >>>> say a week. It will produce a model object which will be saved. In scoring >>>> workflow, you will typically see new unseen dataset and the model generated >>>> in training flow will be used to score or label this new dataset. >>>> >>>> Note, train and test datasets are used during development phase when >>>> you are trying to find out which model to use and >>>> efficientcy/performance/accuracy etc. It will never be part of >>>> workflow. In a little elaborate setting you may want to automate model >>>> evaluations, but that's a different story. >>>> >>>> Not sure if I could explain properly, please feel free to comment. >>>> On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote: >>>> >>>>> Yes, I do apply NaiveBayes after IDF . >>>>> >>>>> " you can re-train (fit) on all your data before applying it to >>>>> unseen data." Did you mean I can reuse that model to Transform both >>>>> training and test data? >>>>> >>>>> Here's the process: >>>>> >>>>> Datasets: >>>>> >>>>> 1. Full sample data (labeled) >>>>> 2. Training (labeled) >>>>> 3. Test (labeled) >>>>> 4. Unseen (non-labeled) >>>>> >>>>> Here are two workflow options I see: >>>>> >>>>> Option - 1 (currently using) >>>>> >>>>> 1. Fit IDF model (idf-1) on full Sample data >>>>> 2. Apply(Transform) idf-1 on full sample data >>>>> 3. Split data set into Training and Test data >>>>> 4. Fit ML model on Training data >>>>> 5. Apply(Transform) model on Test data >>>>> 6. Apply(Transform) idf-1 on Unseen data >>>>> 7. Apply(Transform) model on Unseen data >>>>> >>>>> Option - 2 >>>>> >>>>> 1. Split sample data into Training and Test data >>>>> 2. Fit IDF model (idf-1) only on training data >>>>> 3. Apply(Transform) idf-1 on training data >>>>> 4. Apply(Transform) idf-1 on test data >>>>> 5. Fit ML model on Training data >>>>> 6. Apply(Transform) model on Test data >>>>> 7. Apply(Transform) idf-1 on Unseen data >>>>> 8. Apply(Transform) model on Unseen data >>>>> >>>>> So you are suggesting Option-2 in this particular case, right? >>>>> >>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk> >>>>> wrote: >>>>> >>>>>> Fit it on training data to evaluate the model. You can either use >>>>>> that model to apply to unseen data or you can re-train (fit) on all your >>>>>> data before applying it to unseen data. >>>>>> >>>>>> fit and transform are 2 different things: fit creates a model, >>>>>> transform applies a model to data to create transformed output. If you >>>>>> are >>>>>> using your training data in a subsequent step (e.g. running logistic >>>>>> regression or some other machine learning algorithm) then you need to >>>>>> transform your training data using the IDF model before passing it >>>>>> through >>>>>> the next step. >>>>>> >>>>>> ------------------------------------------------------------ >>>>>> ------------------- >>>>>> Robin East >>>>>> *Spark GraphX in Action* Michael Malak and Robin East >>>>>> Manning Publications Co. >>>>>> http://www.manning.com/books/spark-graphx-in-action >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote: >>>>>> >>>>>> Just to re-iterate what you said, I should fit IDF model only on >>>>>> training data and then re-use it for both test data and then later on >>>>>> unseen data to make predictions. >>>>>> >>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk> >>>>>> wrote: >>>>>> >>>>>>> The point of setting aside a portion of your data as a test set is >>>>>>> to try and mimic applying your model to unseen data. If you fit your IDF >>>>>>> model to all your data, any evaluation you perform on your test set is >>>>>>> likely to over perform compared to ‘real’ unseen data. Effectively you >>>>>>> would have overfit your model. >>>>>>> ------------------------------------------------------------ >>>>>>> ------------------- >>>>>>> Robin East >>>>>>> *Spark GraphX in Action* Michael Malak and Robin East >>>>>>> Manning Publications Co. >>>>>>> http://www.manning.com/books/spark-graphx-in-action >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote: >>>>>>> >>>>>>> FYI, I do reuse IDF model while making prediction against new >>>>>>> unlabeled data but not between training and test data while training a >>>>>>> model. >>>>>>> >>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features >>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then >>>>>>>> transform them. I read somewhere that I should split my data into >>>>>>>> training >>>>>>>> and test before fitting IDF model; Fit IDF only on training data and >>>>>>>> then >>>>>>>> use same transformer to transform training and test data. >>>>>>>> This raise more questions: >>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting >>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea >>>>>>>> is to >>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ? >>>>>>>> 2) If not then fitting and transforming seems redundant for IDF >>>>>>>> model >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [image: What's New with Xactly] >>>>>>> <http://www.xactlycorp.com/email-click/> >>>>>>> >>>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>>> <http://www.youtube.com/xactlycorporation> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [image: What's New with Xactly] >>>>>> <http://www.xactlycorp.com/email-click/> >>>>>> >>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>> <http://www.youtube.com/xactlycorporation> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> [image: What's New with Xactly] >>>>> <http://www.xactlycorp.com/email-click/> >>>>> >>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>> <http://www.youtube.com/xactlycorporation> >>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > -- Best Regards, Ayan Guha