If I broke something, let me know so I can get original functionality
even if it involves some minor changes to incorporate.
I'll go ahead with a variable for the dictionary to store the longest
token count just in case I go that route. The Index was a good idea;
but, it doesn't carry over the case sensitivity to its members that are
in the dictionary.
James
On 3/15/2012 8:06 PM, Jim - FooBar(); wrote:
> aaaaa ok i see what you mean! that makes perfect sense - thanks for
> being so thorough in explaining...
>
> I will check first thing tomorrow morning to see which one was the
> version that was returning multi-word entities properly...In fact i
> specifically remember getting back "both "folic acid" and "valproic
> acid" in the little paragraph i posted....anyway i'll let you know how
> i get on...
>
> Jim
>
>
>
> On 15/03/12 23:48, James Kosin wrote:
>> Jim,
>>
>> The hashcode is used to lookup and compare items more quickly in Java.
>> Basically, if the hashcode matches then the Java machine knows there is
>> a strong possibility two entries are the same. The bad side is the
>> hashcode for the dictionary entries is based on the entire set of tokens
>> in the entry. This means Java won't try comparing two items if the
>> hashcode isn't the same. It is an optimization commonly used.
>>
>> In 1.5.3 we fixed a few more issues with the Dictionary to properly
>> handle the words and case sensitivity. I also made some changes to take
>> out a small section in the DictionaryNameFinder's find() method that
>> used an Index created to determine if we should look and add another
>> word. I may have re-factored this wrong and need to come up with a
>> better solution.
>>
>> We have several possibilities to fix and address this issue. However,
>> some of them involve possibly making this an N^2 problem again for the
>> code. I'm trying to avoid that and fix the problem correctly. Maybe I
>> shouldn't have used hashcode so freely, but, it was how I found the
>> problem. the hashcode for {"folic", "acid"} is different than that for
>> {"folic"}... so, the Dictionary doesn't bother comparing the two. One
>> possibility is to have the entry for {"folic", "acid"} and {"folic"} be
>> the same... only drawback is we loose resolution in finding specific
>> names.
>> Another possible solution would be to keep a max_token_count for the
>> Dictionary to represent the number of tokens that the
>> DictionaryNameFinder would try to put together in the find() method...
>> limiting the greediness to the longest token-list in the dictionary.
>>
>> Could you check with 1.5.2 to see if you can find multi-word with/or
>> without the case sensitivity to verify. If so it limits it to the
>> changes I made in the trunk.
>>
>> Thanks,
>> James
>>
>> On 3/15/2012 5:39 AM, Jim - FooBar(); wrote:
>>> So the problem is all in the hashcode.............
>>>
>>> Does that relate to the question i posted yesterday? I'm a bit
>>> confused...How is the .hashCode() related with not finding multi-word
>>> entities? and also, what happened between versions 1.5.2& 1.5.3
>>> snapshot cos i do remember being able to find multi-word entities at
>>> some point (i think with 1.5.2)...
>>>
>>> Jim
>>>
>>>
>