Hey James,

On 17/03/12 02:49, James Kosin wrote:
If you two could test the latest out, I'd appreciate.
Especially any performance issues, if possible.  I'm trying to be sure I
haven't turned this into a N^2  type problem again.  If so, I'll need to
re-open the JIRA and fix the Index class to handle case sensitivity.
Well, the dictionary name finder is indeed a bit slower now, but that is fine with me...Tested it with a relatively big test corpus and the the Dictionary lookup took 2-3 seconds more than the maxent model. Now, even though that is counter-intuitive (you expect the iterative search to be extremely fast), i simply don't care at this point - it is not a problem for me! The fact that is it finds multi-word tokens is the most important fix with case-sensitivity coming second for me (i can always uncapitalize my dictionary)...
(b)  I tried to fix the DictionaryNameFinder.... woops, I refactored
incorrectly.  Unfortunately, you two are but a few that use the
DictionaryNameFinder.
Maybe we are just a few because the DictionaryNameFinder never quite worked as advertised...i do expect more people to start using it especially if it can be integrated with the maxent model (from the evaluators point of view)...It is the easiest way to improve one's results without cheating!

Thanks for your patience, testing and posting to the list.

Don't mention it man... Thank YOU for addressing it! :-)

I can see clearly now my mistakes.
That is what this is all about! Good stuff...

The code currently in SVN trunk has sort-of a compromise until we get
the Index working again properly with case sensitivity.  It contains
code that will keep trying longer token entries as long as the current
length is less than the maximum held in the dictionary.  This allows the
DictionaryNameFinder's find() method to work; but, we have a small
performance penalty due to the way the find() method isn't caring what
words it adds to the token strings.

So, by 'performance penalty' you mean runtime performance as opposed to accuracy performance yes? This is exactly what i confirmed above...It takes roughly twice the time for the Dictionary to do its job , than it takes the maxent model...Nevertheless, what comes back are the correct, case-insensitive named entities so all is good (at least for me!)...

I'm going to look at possible solutions to getting the Index working
again properly with the DictionaryNameFinder... but, it will take some time.

Excellent...is there any way for me to find out whenever you fix it? I mean, will you post anything here or is there a JIRA i can start "watching"?

Also, if i manage to 'hack' the evaluator to take into account both the maxent model and dictionary findings to improve the statistics, is that something you would consider adding to openNLP? There is a JIRA for it from last year, which i voted for and commented on...I'm not sure if you've seen it...

Thanks again for the patch (regardless of the compromise)...:-)

Jim


Reply via email to