You can do all sorts of things. I implemented a version now that uses ThreadLocals. Works fine, but quite frankly, it's a pain in the butt. The world has been moving to multi-threaded for a long time now, and I think it's a very reasonable assumption that a simple tool like a POS tagger is thread safe, without me as an API user having to think about it.

When I ran the OpenNLP stack multi-threaded and saw the exceptions, I read the signs and figured out what the issue was. Not everybody will be able to do that. They will just see that it crashes, and move on to a different tool. If the POS tagger can not be made thread safe, that's what I will do, actually.

But, that's just my opinion. If your approach works for you, that's great.


On 11/01/2017 16:05, Cohan Sujay Carlos wrote:
I meant:

a)  Instantiate the components in the local scope that leads to their
references being in the call (thread) stack.


On Wed, Jan 11, 2017 at 8:33 PM, Cohan Sujay Carlos <co...@aiaioo.com>
wrote:

Control over threading is not required to "share the model between
threads and create one instance of the component per thread".

One could use a scope where variable references are guaranteed to be
stored in the call stack (say method-local variables in Java).

You could then:

a)  Instantiate the components on the call stack.
b)  Instantiate the models in constructors or the factory methods of a
singleton.

If one were using OpenNLP in a Tomcat webapp, for instance, one could, I
believe, use this method.

Cohan Sujay Carlos


On Wed, Jan 11, 2017 at 7:08 PM, Thilo Goetz <twgo...@gmx.de> wrote:

Correct me if I'm wrong, but that approach only works if you control the
thread creation yourself. In my case, for example, I was using Scala's
parallel collection API, and had no control over the threading. I will
usually want to create one service that does tokenization or POS tagging or
whatever, which can be accessed by many threads. I don't want to have to
mess around with an object pool, or thread locals, or anything like that.
Especially since there is really no good reason IMHO. You could very easily
just return the probabilities together with the spans, and whoever doesn't
need them can ignore them. Or have two methods, one with probabilities, one
without. Maybe it's just where I'm coming from, but I fail to see the
advantages of the current approach.

--Thilo



On 11/01/2017 13:58, Joern Kottmann wrote:

Hello Thilo,

I am interested in your opinion about how this is done currently.
We say: "Share the model between threads and create one instance of the
component per thread".

Wouldn't that work well in your use case?

Jörn



On Wed, Jan 11, 2017 at 11:05 AM, Thilo Goetz <twgo...@gmx.de> wrote:

Hi,
in a recent project, I was using SentenceDetectorME, TokenizerME and
POSTaggerME. It turns out that none of those is thread safe. This is
because the classification probabilities for the last tag() call (for
example) are stored in a member variable and can be retrieved by a
separate
API call.

I'm planning to build thread safe versions for myself, and I'd be happy
to
contribute a patch if there is interest. This could be done as a
conservative extension with an additional method such as tagReentrant,
where the old API calls would continue to work as before and would still
not be thread safe. Alternatively, one could remodel the API so that
everything was thread safe, but that would break backwards
compatibility.

Final question: if I do this for the classes mentioned above, are there
other tools that should be made thread safe while we're at it?

Opinions?

--Thilo





Reply via email to