Hi Pavel,
first of all i would include an intercept term in the model. This learns
the proportion of examples in the training set.
Second, for getting calibrated probabilities out of the downsampled
model, I can think of two ways:
1. Use another set of input data to measure the observed maximum
Oops, hit enter to early...
Just wanted to say that those are the two ways I'm thinking of right now
since i got a similar challenge. I'm thankful for any suggestions or
comments.
Cheers,
Johannes
On Thu, Dec 27, 2012 at 3:13 PM, Johannes Schulte
johannes.schu...@gmail.com wrote:
Hi Pavel,
Random low dimensional projections tend to look like normal distributions.
This is the law of large numbers at work. I think it is hard to diagnose
anything from this.
On the other hand, projections against the principal components tend to
show more structure.
On Thu, Dec 27, 2012 at 11:53 AM,
This paper is probably of interest for this problem:
http://research.microsoft.com/apps/pubs/default.aspx?id=122779
On Thu, Dec 27, 2012 at 6:14 AM, Johannes Schulte
johannes.schu...@gmail.com wrote:
Oops, hit enter to early...
Just wanted to say that those are the two ways I'm thinking
You have a sort of a regression problem here.
Add features of each item if you can. Then add day-of-week, weekend or
holiday features. Fit your regression.
Can you say the size of your data?
On Thu, Dec 27, 2012 at 7:26 AM, Matt Mitchell goodie...@gmail.com wrote:
I'm looking for a way to
I have fixed the vectorizer in knn. Available from [0]
as org.apache.mahout.knn.Vectorize20NewsGroups
Typical invocation would have these command line options:
lic subject false 1000 ../20news-bydate-train
These are:
- term weighting code. lic = log(tf) * IDF with cosine normalization
-