Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Cool! So going back to IDF Estimator and Model problem, do you know what an IDF estimator really does during Fitting process? It must be storing some state (information) as I mentioned in OP (|D|, DF|t, D| and perhaps TF|t, D|) that it re-uses to Transform test data (labeled data). Or does it

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Yes, that is correct. I think I misread a part of it in terms of scoringI think we both are saying same thing so thats a good thing :) On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel wrote: > Hi Ayan, > > "classification algorithm will for sure need to Fit against new

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan, "classification algorithm will for sure need to Fit against new dataset to produce new model" I said this in context of re-training the model. Is it not correct? Isn't it part of re-training? Thanks On Tue, Nov 1, 2016 at 4:01 PM, ayan guha wrote: > Hi > >

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
Hi "classification algorithm will for sure need to Fit against new dataset to produce new model" - I do not think this is correct. Maybe we are talking semantics but AFAIU, you "train" one model using some dataset, and then use it for scoring new datasets. You may re-train every month, yes. And

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Hi Ayan, After deployment, we might re-train it every month. That is whole different problem I have explored yet. classification algorithm will for sure need to Fit against new dataset to produce new model. Correct me if I am wrong but I think I will also FIt new IDF model based on new dataset. At

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread ayan guha
I have come across similar situation recently and decided to run Training workflow less frequently than scoring workflow. In your use case I would imagine you will run IDF fit workflow once in say a week. It will produce a model object which will be saved. In scoring workflow, you will typically

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Yes, I do apply NaiveBayes after IDF . " you can re-train (fit) on all your data before applying it to unseen data." Did you mean I can reuse that model to Transform both training and test data? Here's the process: Datasets: 1. Full sample data (labeled) 2. Training (labeled) 3. Test

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
Fit it on training data to evaluate the model. You can either use that model to apply to unseen data or you can re-train (fit) on all your data before applying it to unseen data. fit and transform are 2 different things: fit creates a model, transform applies a model to data to create

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
Just to re-iterate what you said, I should fit IDF model only on training data and then re-use it for both test data and then later on unseen data to make predictions. On Tue, Nov 1, 2016 at 3:49 AM, Robin East wrote: > The point of setting aside a portion of your data

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
The point of setting aside a portion of your data as a test set is to try and mimic applying your model to unseen data. If you fit your IDF model to all your data, any evaluation you perform on your test set is likely to over perform compared to ‘real’ unseen data. Effectively you would have

Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel wrote: > I am using IDF estimator/model (TF-IDF) to convert text features into >