aaaaa ok i see what you mean! that makes perfect sense - thanks for
being so thorough in explaining...
I will check first thing tomorrow morning to see which one was the
version that was returning multi-word entities properly...In fact i
specifically remember getting back "both "folic acid" and "valproic
acid" in the little paragraph i posted....anyway i'll let you know how i
get on...
Jim
On 15/03/12 23:48, James Kosin wrote:
Jim,
The hashcode is used to lookup and compare items more quickly in Java.
Basically, if the hashcode matches then the Java machine knows there is
a strong possibility two entries are the same. The bad side is the
hashcode for the dictionary entries is based on the entire set of tokens
in the entry. This means Java won't try comparing two items if the
hashcode isn't the same. It is an optimization commonly used.
In 1.5.3 we fixed a few more issues with the Dictionary to properly
handle the words and case sensitivity. I also made some changes to take
out a small section in the DictionaryNameFinder's find() method that
used an Index created to determine if we should look and add another
word. I may have re-factored this wrong and need to come up with a
better solution.
We have several possibilities to fix and address this issue. However,
some of them involve possibly making this an N^2 problem again for the
code. I'm trying to avoid that and fix the problem correctly. Maybe I
shouldn't have used hashcode so freely, but, it was how I found the
problem. the hashcode for {"folic", "acid"} is different than that for
{"folic"}... so, the Dictionary doesn't bother comparing the two. One
possibility is to have the entry for {"folic", "acid"} and {"folic"} be
the same... only drawback is we loose resolution in finding specific names.
Another possible solution would be to keep a max_token_count for the
Dictionary to represent the number of tokens that the
DictionaryNameFinder would try to put together in the find() method...
limiting the greediness to the longest token-list in the dictionary.
Could you check with 1.5.2 to see if you can find multi-word with/or
without the case sensitivity to verify. If so it limits it to the
changes I made in the trunk.
Thanks,
James
On 3/15/2012 5:39 AM, Jim - FooBar(); wrote:
So the problem is all in the hashcode.............
Does that relate to the question i posted yesterday? I'm a bit
confused...How is the .hashCode() related with not finding multi-word
entities? and also, what happened between versions 1.5.2& 1.5.3
snapshot cos i do remember being able to find multi-word entities at
some point (i think with 1.5.2)...
Jim