<done_basking>Grant</done_basking>

Here's an interesting piece:
09/07/22 18:23:02 INFO bayes.TestClassifier: Testing:wikipedia/ subjects/prepared-test/history.txt 09/07/22 18:23:07 INFO bayes.TestClassifier: history 95.458984375 3910/4096.0
09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
09/07/22 18:23:07 INFO bayes.TestClassifier: Testing:/wikipedia/ subjects/prepared-test/science.txt 09/07/22 18:23:08 INFO bayes.TestClassifier: science 15.554072096128172 233/1498.0 09/07/22 18:23:08 INFO bayes.TestClassifier: =======================================================


In other words, I'm really good at predicting History as a category and really bad at predicting Science.

I think the following might help explain why:
ls -l
total 245360
-rwxrwxrwx  1 grantingersoll  staff  89518235 Jul 22 17:53 history.txt*
-rwxrwxrwx  1 grantingersoll  staff  36099183 Jul 22 17:53 science.txt*

The number of history examples is almost double the number of science based on my test set.

There is obviously a teaching moment here. I know there is a lot out there about sample sizes, feature selection etc., can we boil some of these down into some cogent recommendations for our users?


-Grant

On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:

<basking>Grant</basking>

On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:

Getting something to run is a big step. It is important to bask in the glow
for a tiny moment.

On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll <[email protected]>wrote:

Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
3910    186      |  4096        a     = history
1265    233      |  1498        b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a good thing ;-) I have
no clue what the state of the art is these days, but it doesn't seem
_horrendous_ either.

I'd love to see someone validate what I have done. Let me know if you need
more details.  I'd also like to know how I can improve it.




--
Ted Dunning, CTO
DeepDyve



--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to