Yes, I do apply NaiveBayes after IDF . " you can re-train (fit) on all your data before applying it to unseen data." Did you mean I can reuse that model to Transform both training and test data?
Here's the process: Datasets: 1. Full sample data (labeled) 2. Training (labeled) 3. Test (labeled) 4. Unseen (non-labeled) Here are two workflow options I see: Option - 1 (currently using) 1. Fit IDF model (idf-1) on full Sample data 2. Apply(Transform) idf-1 on full sample data 3. Split data set into Training and Test data 4. Fit ML model on Training data 5. Apply(Transform) model on Test data 6. Apply(Transform) idf-1 on Unseen data 7. Apply(Transform) model on Unseen data Option - 2 1. Split sample data into Training and Test data 2. Fit IDF model (idf-1) only on training data 3. Apply(Transform) idf-1 on training data 4. Apply(Transform) idf-1 on test data 5. Fit ML model on Training data 6. Apply(Transform) model on Test data 7. Apply(Transform) idf-1 on Unseen data 8. Apply(Transform) model on Unseen data So you are suggesting Option-2 in this particular case, right? On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk> wrote: > Fit it on training data to evaluate the model. You can either use that > model to apply to unseen data or you can re-train (fit) on all your data > before applying it to unseen data. > > fit and transform are 2 different things: fit creates a model, transform > applies a model to data to create transformed output. If you are using your > training data in a subsequent step (e.g. running logistic regression or > some other machine learning algorithm) then you need to transform your > training data using the IDF model before passing it through the next step. > > ------------------------------------------------------------ > ------------------- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote: > > Just to re-iterate what you said, I should fit IDF model only on training > data and then re-use it for both test data and then later on unseen data to > make predictions. > > On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk> wrote: > >> The point of setting aside a portion of your data as a test set is to try >> and mimic applying your model to unseen data. If you fit your IDF model to >> all your data, any evaluation you perform on your test set is likely to >> over perform compared to ‘real’ unseen data. Effectively you would have >> overfit your model. >> ------------------------------------------------------------ >> ------------------- >> Robin East >> *Spark GraphX in Action* Michael Malak and Robin East >> Manning Publications Co. >> http://www.manning.com/books/spark-graphx-in-action >> >> >> >> >> >> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote: >> >> FYI, I do reuse IDF model while making prediction against new unlabeled >> data but not between training and test data while training a model. >> >> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com> >> wrote: >> >>> I am using IDF estimator/model (TF-IDF) to convert text features into >>> vectors. Currently, I fit IDF model on all sample data and then transform >>> them. I read somewhere that I should split my data into training and test >>> before fitting IDF model; Fit IDF only on training data and then use same >>> transformer to transform training and test data. >>> This raise more questions: >>> 1) Why would you do that? What exactly do IDF learn during fitting >>> process that it can reuse to transform any new dataset. Perhaps idea is to >>> keep same value for |D| and DF|t, D| while use new TF|t, D| ? >>> 2) If not then fitting and transforming seems redundant for IDF model >>> >> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> >> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> > > > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>