The abstraction is getting difficult. Let me get a little more specific, Y is an industry code, there many of them. For each data row (which obvious has more that just 1 predicate) I have an industry code. My original thought was that I could have a prior based on the industry. I could data like:
A,solvent,dust,code=111222 A,insecticide,code=111312 … B,solvent,diesel,code=111222 ... The problem becomes that I am using the Industry distribution from my training set, not the census. By the “best value” I mean when classifying an example that the model has not seen before, I would like the model to classify based on the prior. If p(A|Y)=0.8, select A with p=0.8. Dan On Feb 25, 2016, at 12:02 PM, Nishant Kelkar <[email protected]<mailto:[email protected]>> wrote: I guess I don't quite understand then. So your training data is small, but you have a potentially high cardinality feature Y from a separate source (US Census)...how are you marrying them together then? As in, how does each row in your small training set get a Y? Is, for example, X, a common column between the two sets, where X --> Y is a one-to-many mapping? As far as using the information provided by Y, I think any model that estimates a joint probability P(Y, X, label) will inadvertently end up using information about P(label | Y), no? Also, what does your last line in your previously email mean ("If possible I would like to use the best values available.")? Best, Nishant On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] < [email protected]<mailto:[email protected]>> wrote: Yes, but my training data is a small biased sample whereas feature “Y” are population values (actually taken from the US Census, so a very large sample). If possible I would like to use the best values available. Daniel Russ, Ph.D. Staff Scientist, Division of Computational Bioscience Center for Information Technology National Institutes of Health U.S. Department of Health and Human Services 12 South Drive Bethesda, MD 20892-5624 On Feb 25, 2016, at 11:29 AM, Nishant Kelkar <[email protected]<mailto:[email protected]> <javascript:;><mailto:[email protected] <javascript:;>>> wrote: Hi Dan, Can't you call (A, Q) as A', (A,R) as A'', and so on...and just treat them as separate labels altogether? Your classifier can then learn using these "fake" labels. You can then have an in memory map of what each fake label (A'' for example) corresponds to in reality (A'' in this case = (A, R)). Best Regards, Nishant Kelkar On Thursday, February 25, 2016, Russ, Daniel (NIH/CIT) [E] < [email protected]<mailto:[email protected]> <javascript:;><mailto:[email protected] <javascript:;>>> wrote: I am not sure I understand. When I think of the kernel trick, I think of converting a linear decision boundary into a higher order decision boundary. (i.e. r<-x^2 + y^2 giving a circular decision boundary). Maybe I am missing something? I’ll look into this a bit more. Dan On Feb 25, 2016, at 11:11 AM, Alexander Wallin < [email protected]<mailto:[email protected]> <javascript:;><mailto: [email protected]<mailto:[email protected]> <javascript:;>> <javascript:;>> wrote: Can’t you make a compounded feature (or features), i.e. use the kernel trick? Alexander 25 feb. 2016 kl. 17:06 skrev Russ, Daniel (NIH/CIT) [E] < [email protected]<mailto:[email protected]> <javascript:;><mailto:[email protected] <javascript:;>> <javascript:;>>: Hi, Is it possible to change the prior based on a feature? For example, if I have the follow data (very simplified) Class, Predicates A, X A, X B, X You would expect class A 2/3 of the time when the feature is just predicate X. However, lets say I know that another feature Y that can take values {Q,R,S). P(A|Q)=0.8;P(A|R)=0.1;P(A|S)=0.3. Is there any way to add feature Y to the classifier taking advantage of this information? Thanks Dan
