<done_basking>Grant</done_basking>
Here's an interesting piece:
09/07/22 18:23:02 INFO bayes.TestClassifier: Testing:wikipedia/
subjects/prepared-test/history.txt
09/07/22 18:23:07 INFO bayes.TestClassifier: history 95.458984375
3910/4096.0
09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
09/07/22 18:23:07 INFO bayes.TestClassifier: Testing:/wikipedia/
subjects/prepared-test/science.txt
09/07/22 18:23:08 INFO bayes.TestClassifier: science
15.554072096128172 233/1498.0
09/07/22 18:23:08 INFO bayes.TestClassifier:
=======================================================
In other words, I'm really good at predicting History as a category
and really bad at predicting Science.
I think the following might help explain why:
ls -l
total 245360
-rwxrwxrwx 1 grantingersoll staff 89518235 Jul 22 17:53 history.txt*
-rwxrwxrwx 1 grantingersoll staff 36099183 Jul 22 17:53 science.txt*
The number of history examples is almost double the number of science
based on my test set.
There is obviously a teaching moment here. I know there is a lot out
there about sample sizes, feature selection etc., can we boil some of
these down into some cogent recommendations for our users?
-Grant
On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:
<basking>Grant</basking>
On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:
Getting something to run is a big step. It is important to bask in
the glow
for a tiny moment.
On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll
<[email protected]>wrote:
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
3910 186 | 4096 a = history
1265 233 | 1498 b = science
Default Category: unknown: 2
</snip>
At least it's better than 50%, which is presumably a good
thing ;-) I have
no clue what the state of the art is these days, but it doesn't seem
_horrendous_ either.
I'd love to see someone validate what I have done. Let me know if
you need
more details. I'd also like to know how I can improve it.
--
Ted Dunning, CTO
DeepDyve
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search