The point of setting aside a portion of your data as a test set is to try and 
mimic applying your model to unseen data. If you fit your IDF model to all your 
data, any evaluation you perform on your test set is likely to over perform 
compared to ‘real’ unseen data. Effectively you would have overfit your model.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>





> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote:
> 
> FYI, I do reuse IDF model while making prediction against new unlabeled data 
> but not between training and test data while training a model. 
> 
> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com 
> <mailto:npa...@xactlycorp.com>> wrote:
> I am using IDF estimator/model (TF-IDF) to convert text features into 
> vectors. Currently, I fit IDF model on all sample data and then transform 
> them. I read somewhere that I should split my data into training and test 
> before fitting IDF model; Fit IDF only on training data and then use same 
> transformer to transform training and test data. 
> This raise more questions:
> 1) Why would you do that? What exactly do IDF learn during fitting process 
> that it can reuse to transform any new dataset. Perhaps idea is to keep same 
> value for |D| and DF|t, D| while use new TF|t, D| ?
> 2) If not then fitting and transforming seems redundant for IDF model
> 
> 
> 
> 
>  <http://www.xactlycorp.com/email-click/>
> 
>  <https://www.nyse.com/quote/XNYS:XTLY>   
> <https://www.linkedin.com/company/xactly-corporation>   
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
> <http://www.youtube.com/xactlycorporation>

Reply via email to