Re: Spark ML - Is IDF model reusable

Nirav Patel Tue, 01 Nov 2016 11:49:14 -0700

Hi Ayan,
After deployment, we might re-train it every month. That is whole different
problem I have explored yet. classification algorithm will for sure need to
Fit against new dataset to produce new model. Correct me if I am wrong but
I think I will also FIt new IDF model based on new dataset. At that time as
well I will follow same training-validation split (or corss-validation) to
evaluate model performance on new data before releasing it to make
prediction. So afik , every time you  need to re-train model you will need
to corss validate using some data split strategy.


I think spark ML document should start explaining mathematical model or
simple algorithm what Fit and Transform means for particular algorithm
(IDF, NaiveBayes)

Thanks

On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote:

> I have come across similar situation recently and decided to run Training
> workflow less frequently than scoring workflow.
>
> In your use case I would imagine you will run IDF fit workflow once in say
> a week. It will produce a model object which will be saved. In scoring
> workflow, you will typically see new unseen dataset and the model generated
> in training flow will be used to score or label this new dataset.
>
> Note, train and test datasets are used during development phase when you
> are trying to find out which model to use and efficientcy/performance/accuracy
> etc. It will never be part of workflow. In a little elaborate setting you
> may want to automate model evaluations, but that's a different story.
>
> Not sure if I could explain properly, please feel free to comment.
> On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>
>> Yes, I do apply NaiveBayes after IDF .
>>
>> " you can re-train (fit) on all your data before applying it to unseen
>> data." Did you mean I can reuse that model to Transform both training and
>> test data?
>>
>> Here's the process:
>>
>> Datasets:
>>
>>    1. Full sample data (labeled)
>>    2. Training (labeled)
>>    3. Test (labeled)
>>    4. Unseen (non-labeled)
>>
>> Here are two workflow options I see:
>>
>> Option - 1 (currently using)
>>
>>    1. Fit IDF model (idf-1) on full Sample data
>>    2. Apply(Transform) idf-1 on full sample data
>>    3. Split data set into Training and Test data
>>    4. Fit ML model on Training data
>>    5. Apply(Transform) model on Test data
>>    6. Apply(Transform) idf-1 on Unseen data
>>    7. Apply(Transform) model on Unseen data
>>
>> Option - 2
>>
>>    1. Split sample data into Training and Test data
>>    2. Fit IDF model (idf-1) only on training data
>>    3. Apply(Transform) idf-1 on training data
>>    4. Apply(Transform) idf-1 on test data
>>    5. Fit ML model on Training data
>>    6. Apply(Transform) model on Test data
>>    7. Apply(Transform) idf-1 on Unseen data
>>    8. Apply(Transform) model on Unseen data
>>
>> So you are suggesting Option-2 in this particular case, right?
>>
>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk>
>> wrote:
>>
>>> Fit it on training data to evaluate the model. You can either use that
>>> model to apply to unseen data or you can re-train (fit) on all your data
>>> before applying it to unseen data.
>>>
>>> fit and transform are 2 different things: fit creates a model, transform
>>> applies a model to data to create transformed output. If you are using your
>>> training data in a subsequent step (e.g. running logistic regression or
>>> some other machine learning algorithm) then you need to transform your
>>> training data using the IDF model before passing it through the next step.
>>>
>>> ------------------------------------------------------------
>>> -------------------
>>> Robin East
>>> *Spark GraphX in Action* Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>>
>>>
>>>
>>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>
>>> Just to re-iterate what you said, I should fit IDF model only on
>>> training data and then re-use it for both test data and then later on
>>> unseen data to make predictions.
>>>
>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk>
>>> wrote:
>>>
>>>> The point of setting aside a portion of your data as a test set is to
>>>> try and mimic applying your model to unseen data. If you fit your IDF model
>>>> to all your data, any evaluation you perform on your test set is likely to
>>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>>> overfit your model.
>>>> ------------------------------------------------------------
>>>> -------------------
>>>> Robin East
>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>>
>>>> FYI, I do reuse IDF model while making prediction against new unlabeled
>>>> data but not between training and test data while training a model.
>>>>
>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I am using IDF estimator/model (TF-IDF) to convert text features into
>>>>> vectors. Currently, I fit IDF model on all sample data and then transform
>>>>> them. I read somewhere that I should split my data into training and test
>>>>> before fitting IDF model; Fit IDF only on training data and then use same
>>>>> transformer to transform training and test data.
>>>>> This raise more questions:
>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>> process that it can reuse to transform any new dataset. Perhaps idea is to
>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Reply via email to