I've used systems before that kept the original mapping to the classifier specific mapping. It can be nice because you can add new features and an old model may still work because the new features would be out of range of the old mappings. It can also provide a place to store score statistics (such as min / max / avg / std dev) for classifiers that need to normalize their features, such as the linear models.
It could be something like this FeatureInfo int32 original_index int32 internal_index float min_value float max_value FeatureSetInfo repeated FeatureInfo The drawback is potentially adding 32-bytes per feature, which could be detrimental in terms of size, especially for high dimensional feature spaces (e.g. text). If the writable interface could make this optional it would work. Or we could make all classifiers have a fixed header that we write containing the common meta-data followed by the actual model itself. On Mon, Jun 6, 2011 at 3:17 AM, Ted Dunning <[email protected]> wrote: > You have to remember that mapping. You will have created it when you > encoded the target variable. > > This is occasionally a nasty problem. I have considered adding the ability > to record a dictionary in the classification models, but have not done so. > > What interface would you like to see? > > Hector, you might like a vote on this. What do you think? > > Jeff, what do you think about the impact on the clustering/classification > unification? > > On Sat, Jun 4, 2011 at 10:39 PM, XiaoboGu <[email protected]> wrote: > > > How can I find the map between original target labels and the encoded > > target codes? > > > -- Yee Yang Li Hector http://hectorgon.blogspot.com/ (tech + travel) http://hectorgon.com (book reviews)
