Re: Spark ML - Is IDF model reusable

ayan guha Tue, 01 Nov 2016 16:10:38 -0700

Yes, that is correct. I think I misread a part of it in terms of
scoring....I think we both are saying same thing so thats a good thing :)


On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel <npa...@xactlycorp.com> wrote:

> Hi Ayan,
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" I said this in context of re-training the model. Is
> it not correct? Isn't it part of re-training?
>
> Thanks
>
> On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> "classification algorithm will for sure need to Fit against new dataset
>> to produce new model" - I do not think this is correct. Maybe we are
>> talking semantics but AFAIU, you "train" one model using some dataset, and
>> then use it for scoring new datasets.
>>
>> You may re-train every month, yes. And you may run cross validation once
>> a month (after re-training) or lower freq like once in 2-3 months to
>> validate model quality. Here, number of months are not important, but you
>> must be running cross validation and similar sort of "model evaluation"
>> work flow typically in lower frequency than Re-Training process.
>>
>> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> Hi Ayan,
>>> After deployment, we might re-train it every month. That is whole
>>> different problem I have explored yet. classification algorithm will for
>>> sure need to Fit against new dataset to produce new model. Correct me if I
>>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>>> that time as well I will follow same training-validation split (or
>>> corss-validation) to evaluate model performance on new data before
>>> releasing it to make prediction. So afik , every time you  need to re-train
>>> model you will need to corss validate using some data split strategy.
>>>
>>> I think spark ML document should start explaining mathematical model or
>>> simple algorithm what Fit and Transform means for particular algorithm
>>> (IDF, NaiveBayes)
>>>
>>> Thanks
>>>
>>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> I have come across similar situation recently and decided to run
>>>> Training  workflow less frequently than scoring workflow.
>>>>
>>>> In your use case I would imagine you will run IDF fit workflow once in
>>>> say a week. It will produce a model object which will be saved. In scoring
>>>> workflow, you will typically see new unseen dataset and the model generated
>>>> in training flow will be used to score or label this new dataset.
>>>>
>>>> Note, train and test datasets are used during development phase when
>>>> you are trying to find out which model to use and
>>>> efficientcy/performance/accuracy etc. It will never be part of
>>>> workflow. In a little elaborate setting you may want to automate model
>>>> evaluations, but that's a different story.
>>>>
>>>> Not sure if I could explain properly, please feel free to comment.
>>>> On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>>>>
>>>>> Yes, I do apply NaiveBayes after IDF .
>>>>>
>>>>> " you can re-train (fit) on all your data before applying it to
>>>>> unseen data." Did you mean I can reuse that model to Transform both
>>>>> training and test data?
>>>>>
>>>>> Here's the process:
>>>>>
>>>>> Datasets:
>>>>>
>>>>>    1. Full sample data (labeled)
>>>>>    2. Training (labeled)
>>>>>    3. Test (labeled)
>>>>>    4. Unseen (non-labeled)
>>>>>
>>>>> Here are two workflow options I see:
>>>>>
>>>>> Option - 1 (currently using)
>>>>>
>>>>>    1. Fit IDF model (idf-1) on full Sample data
>>>>>    2. Apply(Transform) idf-1 on full sample data
>>>>>    3. Split data set into Training and Test data
>>>>>    4. Fit ML model on Training data
>>>>>    5. Apply(Transform) model on Test data
>>>>>    6. Apply(Transform) idf-1 on Unseen data
>>>>>    7. Apply(Transform) model on Unseen data
>>>>>
>>>>> Option - 2
>>>>>
>>>>>    1. Split sample data into Training and Test data
>>>>>    2. Fit IDF model (idf-1) only on training data
>>>>>    3. Apply(Transform) idf-1 on training data
>>>>>    4. Apply(Transform) idf-1 on test data
>>>>>    5. Fit ML model on Training data
>>>>>    6. Apply(Transform) model on Test data
>>>>>    7. Apply(Transform) idf-1 on Unseen data
>>>>>    8. Apply(Transform) model on Unseen data
>>>>>
>>>>> So you are suggesting Option-2 in this particular case, right?
>>>>>
>>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Fit it on training data to evaluate the model. You can either use
>>>>>> that model to apply to unseen data or you can re-train (fit) on all your
>>>>>> data before applying it to unseen data.
>>>>>>
>>>>>> fit and transform are 2 different things: fit creates a model,
>>>>>> transform applies a model to data to create transformed output. If you 
>>>>>> are
>>>>>> using your training data in a subsequent step (e.g. running logistic
>>>>>> regression or some other machine learning algorithm) then you need to
>>>>>> transform your training data using the IDF model before passing it 
>>>>>> through
>>>>>> the next step.
>>>>>>
>>>>>> ------------------------------------------------------------
>>>>>> -------------------
>>>>>> Robin East
>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>> Manning Publications Co.
>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>>>>
>>>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>>>> training data and then re-use it for both test data and then later on
>>>>>> unseen data to make predictions.
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk>
>>>>>> wrote:
>>>>>>
>>>>>>> The point of setting aside a portion of your data as a test set is
>>>>>>> to try and mimic applying your model to unseen data. If you fit your IDF
>>>>>>> model to all your data, any evaluation you perform on your test set is
>>>>>>> likely to over perform compared to ‘real’ unseen data. Effectively you
>>>>>>> would have overfit your model.
>>>>>>> ------------------------------------------------------------
>>>>>>> -------------------
>>>>>>> Robin East
>>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>>> Manning Publications Co.
>>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>>>>>
>>>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>>>> unlabeled data but not between training and test data while training a
>>>>>>> model.
>>>>>>>
>>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features
>>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then
>>>>>>>> transform them. I read somewhere that I should split my data into 
>>>>>>>> training
>>>>>>>> and test before fitting IDF model; Fit IDF only on training data and 
>>>>>>>> then
>>>>>>>> use same transformer to transform training and test data.
>>>>>>>> This raise more questions:
>>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea 
>>>>>>>> is to
>>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>>>> 2) If not then fitting and transforming seems redundant for IDF
>>>>>>>> model
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: What's New with Xactly]
>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>
>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>



-- 
Best Regards,
Ayan Guha

Re: Spark ML - Is IDF model reusable

Reply via email to