On our site we will use Logistic Regression in a batch manner,
customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
will be used to train the model, and customers entered in another time
frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
then the model will be used to predict users entered after 2011/6/1,
does this make sense, or should we feed all data from 2010/1/1 to
2011/5/31 to ALR, and let it do the hold-out internally?



On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <[email protected]> wrote:
> You don't *have* to have a separate validation set, but it isn't a bad idea.
>
> In particular, with large scale classifiers production data almost always
> comes from the future with respect to the training data.  The ADR can't hold
> out that way because it does on-line training only.  Thus, I would recommend
> recommend that you still have some kind of evaluation hold-out set
> segregated by time.
>
> Another very serious issue can happen if you have near duplicates in your
> data set.  That often happens in news-wire text, for example.  In that case,
> you would have significant over-fitting with ADR and you wouldn't have a
> clue without a real time-segregated hold-out set.
>
> On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <[email protected]> wrote:
>
>> Hi,
>>
>> Because ADR split the training data internally automatically,so I
>> think we don't have to make a separate validation data set.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Reply via email to