Look at the Mahout project and the "Mahout In Action" and "Taming Text" books. 
These have a lot to say about categorizing documents well.
http://mahout.apache.org
http://www.manning.com/owen
http://www.manning.com/ingersoll/

----- Original Message -----
| From: "Jonathan Boston" <[email protected]>
| To: [email protected]
| Sent: Thursday, November 29, 2012 2:44:13 PM
| Subject: Best way to use the Document Categorization
| 
| Hi,
| 
| I'm trying to use the Document Categorization over a large set of
| text and could use some help. I've just briefly looked into MaxEnt
| so I'm unsure of the best approach.
| 
| For my project, the texts are categorized, but some percentage
| (probably around 20%) of them are incorrect. Some of them could also
| legitimately fall under multiple categories. I want to correct the
| category for each text, but only if it has a high probability of
| being categorized correctly. I've considered "high" to be > .5 with
| no other categories being > .25.
| 
| My initial approach has been to try and bootstrap a training set by:
| 
| 1) Randomly picking a text adding it to the current training set
| 2) Testing each of the texts in the training set with the new model
| 3) Only keeping the newest text if each of the training texts have a
| high probability of matching the initial category that was provided.
| 4) Repeat
| 
| I've had limited success building up a training set, but it takes a
| while to train the model as more records are added to the training
| set.
| 
| Does this seem like a reasonable approach? Will the model perform
| well if there are a few incorrectly categorized texts in the
| training set?
| 
| Thanks.
| 
| -boston

Reply via email to