On Mar 14, 2011, at 8:10 AM, Jörn Kottmann wrote: > On 3/9/11 3:33 PM, Grant Ingersoll wrote: >> On Mar 7, 2011, at 7:16 AM, Jörn Kottmann wrote: >> >>> On 3/6/11 1:37 PM, Grant Ingersoll wrote: >>>> On Mar 5, 2011, at 2:13 PM, Jörn Kottmann wrote: >>>>> I actually tried to ask how you would do that. I don't think it is super >>>>> simple. Can you please shortly >>>>> explain what you have in mind? >>>> From the looks of it, we'd just need to return the bestSequence object (or >>>> some larger containing object) out to the user and not use it (or other >>>> pieces that may change) as a member variable. Granted, I'm still learning >>>> the code, so I likely am misreading some things. From the looks of it, >>>> though, simply changing the tag method to return the bestSequence would >>>> let the user make the appropriate calls to best outcome and to get the >>>> probabilities (or the probs() method could take in the bestSequence object >>>> if you wanted to keep that convenience) >>>> >>>> I suppose I should just work up a patch, it would be a lot easier than >>>> discussing it in the abstract. >>>> >>> There is also a cache which must be created then per call, we need to do >>> some measuring >>> how expensive that is compared to the current solution. >>> >>> The POS Tagger should also use the new feature generation stuff we made >>> for the name finder, but that is not thread safe by design, because it has a >>> state. The state is necessary to support per document features like we have >>> it in >>> the name finder. >>> >>> Do you think making the name finder and other components thread safe in the >>> same way is also possible? >> Not sure. I only noticed it in the POS tagger. >> >>> Right now we have the same thread-safety convention >>> for all components, which I like because it is easy for some one new to >>> learn. >>> When it is mixed, e.g. POS Tagger thread safe and name finder not, then >>> people >>> will get confused. >> It is no doubt a hard problem. There is always this tradeoff between easy >> to learn and fast, it seems. In my experience, most programmers aren't good >> at concurrent programming (and I certainly don't claim to be either) and so >> it is hard to get it right. I think one of the big wins for us could be to >> make OpenNLP really fast, which will increase its viability and attract >> others. > > Making OpenNLP much faster is of course good. When we discuss performance > changes we also need to > know how much that change would speed things up. In my eyes the most to gain > is currently with > optimizing the feature generation, making the caching more efficient, etc.. >
Agreed. If others aren't aware of it, YourKit gives out free Open Source licenses to their profiler for Apache committers. Details on their website. > How much faster do you think the POS Tagger will be with your proposed change? This change isn't about performance, it's about thread safety. Like I said, instead of talking on it, I'll put up a patch as soon as I get some spare time.
