That does not seem scalable or flexible.

We have to assume our training sample is accurate for some (possibly very long) period of time, and as information changes we can not update our model to compensate or help it "learn"? If we wanted to add a "new category", we'd have to delete the existing model and start over by appending the category-training-data to the original stream and building an entirely new model?

As I mentioned, I am entirely new to NLP, but want to ask: Are these limitations specific to openNLP, or is this common amongst other libraries/technologies implementing document categorizers?

I have not checked "under the hood" of some of the anti-spam (bayesian) categorizers, but don't they have a mechanism to continually "learn" by the user providing ongoing samples (spam/ham in that case)?

Thanks for the reply.  Hope to learn more.

-AJ


On 9/10/2015 4:02 AM, Joern Kottmann wrote:
Hello,

you can train a model only once. After it is trained it is not possible to
continue with the training by adding more samples to it.

You need to create a stream of DocumentSample objects and a straight
forward way to do that is to just collection them all in a collection and
then create a stream from that collection.

HTH,
Jörn

On Wed, Sep 9, 2015 at 9:51 PM, AJ Weber <[email protected]> wrote:

So I'm just getting started with openNLP and trying to spin-up the DocCat.

I would like to process a series of files in batches to train the document
categorizer.

I assume it is possible to loop through documents:

1) extract the text (will probably try Tika for this), and then
2) send the DocumentSample to the categorizer to add to the model?

I see how I can create a DocumentSample from a category (I will know this
as part of the batch args) and the extracted text.  However, I can not
figure out how to incrementally add that sample to a new (or existing)
model for additional "training".

Obviously, I would like to then save the model between batches so I can
either leverage it for categorization or incrementally add more Document
Sample's to it for further training at some later time.

Does anyone have a java snippet I could look at to help me get started?

Thank you!

-AJ



Reply via email to