Re: Spark ML - Is IDF model reusable

Nirav Patel Tue, 01 Nov 2016 16:05:36 -0700

Hi Ayan,

"classification algorithm will for sure need to Fit against new dataset to
produce new model" I said this in context of re-training the model. Is it
not correct? Isn't it part of re-training?


Thanks

On Tue, Nov 1, 2016 at 4:01 PM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> "classification algorithm will for sure need to Fit against new dataset
> to produce new model" - I do not think this is correct. Maybe we are
> talking semantics but AFAIU, you "train" one model using some dataset, and
> then use it for scoring new datasets.
>
> You may re-train every month, yes. And you may run cross validation once a
> month (after re-training) or lower freq like once in 2-3 months to validate
> model quality. Here, number of months are not important, but you must be
> running cross validation and similar sort of "model evaluation" work flow
> typically in lower frequency than Re-Training process.
>
> On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel <npa...@xactlycorp.com> wrote:
>
>> Hi Ayan,
>> After deployment, we might re-train it every month. That is whole
>> different problem I have explored yet. classification algorithm will for
>> sure need to Fit against new dataset to produce new model. Correct me if I
>> am wrong but I think I will also FIt new IDF model based on new dataset. At
>> that time as well I will follow same training-validation split (or
>> corss-validation) to evaluate model performance on new data before
>> releasing it to make prediction. So afik , every time you  need to re-train
>> model you will need to corss validate using some data split strategy.
>>
>> I think spark ML document should start explaining mathematical model or
>> simple algorithm what Fit and Transform means for particular algorithm
>> (IDF, NaiveBayes)
>>
>> Thanks
>>
>> On Tue, Nov 1, 2016 at 5:45 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> I have come across similar situation recently and decided to run
>>> Training  workflow less frequently than scoring workflow.
>>>
>>> In your use case I would imagine you will run IDF fit workflow once in
>>> say a week. It will produce a model object which will be saved. In scoring
>>> workflow, you will typically see new unseen dataset and the model generated
>>> in training flow will be used to score or label this new dataset.
>>>
>>> Note, train and test datasets are used during development phase when you
>>> are trying to find out which model to use and 
>>> efficientcy/performance/accuracy
>>> etc. It will never be part of workflow. In a little elaborate setting you
>>> may want to automate model evaluations, but that's a different story.
>>>
>>> Not sure if I could explain properly, please feel free to comment.
>>> On 1 Nov 2016 22:54, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>>>
>>>> Yes, I do apply NaiveBayes after IDF .
>>>>
>>>> " you can re-train (fit) on all your data before applying it to unseen
>>>> data." Did you mean I can reuse that model to Transform both training and
>>>> test data?
>>>>
>>>> Here's the process:
>>>>
>>>> Datasets:
>>>>
>>>>    1. Full sample data (labeled)
>>>>    2. Training (labeled)
>>>>    3. Test (labeled)
>>>>    4. Unseen (non-labeled)
>>>>
>>>> Here are two workflow options I see:
>>>>
>>>> Option - 1 (currently using)
>>>>
>>>>    1. Fit IDF model (idf-1) on full Sample data
>>>>    2. Apply(Transform) idf-1 on full sample data
>>>>    3. Split data set into Training and Test data
>>>>    4. Fit ML model on Training data
>>>>    5. Apply(Transform) model on Test data
>>>>    6. Apply(Transform) idf-1 on Unseen data
>>>>    7. Apply(Transform) model on Unseen data
>>>>
>>>> Option - 2
>>>>
>>>>    1. Split sample data into Training and Test data
>>>>    2. Fit IDF model (idf-1) only on training data
>>>>    3. Apply(Transform) idf-1 on training data
>>>>    4. Apply(Transform) idf-1 on test data
>>>>    5. Fit ML model on Training data
>>>>    6. Apply(Transform) model on Test data
>>>>    7. Apply(Transform) idf-1 on Unseen data
>>>>    8. Apply(Transform) model on Unseen data
>>>>
>>>> So you are suggesting Option-2 in this particular case, right?
>>>>
>>>> On Tue, Nov 1, 2016 at 4:24 AM, Robin East <robin.e...@xense.co.uk>
>>>> wrote:
>>>>
>>>>> Fit it on training data to evaluate the model. You can either use that
>>>>> model to apply to unseen data or you can re-train (fit) on all your data
>>>>> before applying it to unseen data.
>>>>>
>>>>> fit and transform are 2 different things: fit creates a model,
>>>>> transform applies a model to data to create transformed output. If you are
>>>>> using your training data in a subsequent step (e.g. running logistic
>>>>> regression or some other machine learning algorithm) then you need to
>>>>> transform your training data using the IDF model before passing it through
>>>>> the next step.
>>>>>
>>>>> ------------------------------------------------------------
>>>>> -------------------
>>>>> Robin East
>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1 Nov 2016, at 11:18, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>>>
>>>>> Just to re-iterate what you said, I should fit IDF model only on
>>>>> training data and then re-use it for both test data and then later on
>>>>> unseen data to make predictions.
>>>>>
>>>>> On Tue, Nov 1, 2016 at 3:49 AM, Robin East <robin.e...@xense.co.uk>
>>>>> wrote:
>>>>>
>>>>>> The point of setting aside a portion of your data as a test set is to
>>>>>> try and mimic applying your model to unseen data. If you fit your IDF 
>>>>>> model
>>>>>> to all your data, any evaluation you perform on your test set is likely 
>>>>>> to
>>>>>> over perform compared to ‘real’ unseen data. Effectively you would have
>>>>>> overfit your model.
>>>>>> ------------------------------------------------------------
>>>>>> -------------------
>>>>>> Robin East
>>>>>> *Spark GraphX in Action* Michael Malak and Robin East
>>>>>> Manning Publications Co.
>>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1 Nov 2016, at 10:15, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>>>>
>>>>>> FYI, I do reuse IDF model while making prediction against new
>>>>>> unlabeled data but not between training and test data while training a
>>>>>> model.
>>>>>>
>>>>>> On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel <npa...@xactlycorp.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am using IDF estimator/model (TF-IDF) to convert text features
>>>>>>> into vectors. Currently, I fit IDF model on all sample data and then
>>>>>>> transform them. I read somewhere that I should split my data into 
>>>>>>> training
>>>>>>> and test before fitting IDF model; Fit IDF only on training data and 
>>>>>>> then
>>>>>>> use same transformer to transform training and test data.
>>>>>>> This raise more questions:
>>>>>>> 1) Why would you do that? What exactly do IDF learn during fitting
>>>>>>> process that it can reuse to transform any new dataset. Perhaps idea is 
>>>>>>> to
>>>>>>> keep same value for |D| and DF|t, D| while use new TF|t, D| ?
>>>>>>> 2) If not then fitting and transforming seems redundant for IDF model
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark ML - Is IDF model reusable

Reply via email to