I've used systems before that kept the original mapping to the classifier
specific mapping.
It can be nice because you can add new features and an old model may still
work because the new features would be out of range of the old mappings.
It can also provide a place to store score statistics (such as min / max /
avg / std dev) for classifiers that need to normalize their features, such
as the linear models.

It could be something like this

FeatureInfo
  int32 original_index
  int32 internal_index
  float min_value
  float max_value

FeatureSetInfo
  repeated FeatureInfo

The drawback is potentially adding 32-bytes per feature, which could be
detrimental in terms of size, especially for high dimensional feature spaces
(e.g. text).
If the writable interface could make this optional it would work.
Or we could make all classifiers have a fixed header that we write
containing the common meta-data followed by the actual model itself.

On Mon, Jun 6, 2011 at 3:17 AM, Ted Dunning <[email protected]> wrote:

> You have to remember that mapping.  You will have created it when you
> encoded the target variable.
>
> This is occasionally a nasty problem.  I have considered adding the ability
> to record a dictionary in the classification models, but have not done so.
>
> What interface would you like to see?
>
> Hector, you might like a vote on this.  What do you think?
>
> Jeff, what do you think about the impact on the clustering/classification
> unification?
>
> On Sat, Jun 4, 2011 at 10:39 PM, XiaoboGu <[email protected]> wrote:
>
> > How can I find the map between original target labels and the encoded
> > target codes?
> >
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Reply via email to