Re: Fwd: Re: Some questions about Dictionary and DictionaryNameFinder

Jim - FooBar(); Fri, 24 Feb 2012 09:21:30 -0800

Aaaaa ok, so instead of

<entry>
<token>Peginterferon alfa-2a</token>
</entry>


i should have: (2 tokens)

<entry>
<token>Peginterferon</token>
<token>alfa-2a</token>
</entry>

I see...I'll produce a new Dictionary and let you know how i goton...Thanks, a lot...


Jim


On 24/02/12 12:48, Jörn Kottmann wrote:

Ahh, yes that is why it does not match multi-token entries.
In the posted dictionary two tokens are encoded as one.

Jörn

On 02/24/2012 12:39 PM, [email protected] wrote:

Jim,

The format is wrong. We already asked you to try using the
DictionaryBuilder tool:

input.txt:
--------
Lepirudin
Cetuximab
Dornase Alfa
Denileukin diftitox
Etanercept
Bivalirudin
Leuprolide
Peginterferon alfa-2a
Alteplase
--------

command:

bin/opennlp DictionaryBuilder -inputFile input.txt -outputFileoutput.xml

-encoding<encoding of inputFile>

output.xml
------
<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
<entry>
<token>Etanercept</token>
</entry>
<entry>
<token>Dornase</token>
<token>Alfa</token>
</entry>
<entry>
<token>Peginterferon</token>
<token>alfa-2a</token>
</entry>
<entry>
<token>Alteplase</token>
</entry>
<entry>
<token>Leuprolide</token>
</entry>
<entry>
<token>Denileukin</token>
<token>diftitox</token>
</entry>
<entry>
<token>Bivalirudin</token>
</entry>
<entry>
<token>Cetuximab</token>
</entry>
<entry>
<token>Lepirudin</token>
</entry>
</dictionary>
------

Regards,
William

On Fri, Feb 24, 2012 at 8:38 AM, Jim -FooBar();<[email protected]>wrote:

On 24/02/12 05:09, James Kosin wrote:
Jim,

Maybe the problem is how you have created the dictionary.  The
DictionaryNameFinder's find() method is a greedy method that willmatch
as many tokens as possible.
If it isn't matching more than one token than that is probably all the
dictionary contains per entry.

Look at the simple example in the test packages for
opennlp.tools.namefind DictionaryNameFinderTest.java in the source
packages.

There has a good example.

James
Hi James,

Well, the dictionary i created manually...basically i extracted all the
drug-names from drugbank.xml and wrote them to a txt file (one entryperline). then i processed that text-file in order to produce the xmlversionof the proper dictionary. What i have after doing all that is a filewith
contents of the type:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
<entry><token>Lepirudin</**token></entry>
<entry><token>Cetuximab</**token></entry>
<entry><token>Dornase Alfa</token></entry>
<entry><token>Denileukin diftitox</token></entry>
<entry><token>Etanercept</**token></entry>
<entry><token>Bivalirudin</**token></entry>
<entry><token>Leuprolide</**token></entry>
<entry><token>Peginterferon alfa-2a</token></entry>
<entry><token>Alteplase</**token></entry>
......
......
......etc etc

As you can see some drugs are multi-word entities and also the first
character of each word is capitalized. Whenever i call the find()method
all i'm getting are the exact matches which means that case-sensitivity
doesn ot work either!!! For example i'm getting "Cetuximab" but not
"cetuximab"...so the problem is twofold...Firstly and moreimportantly Icannot find multi-word entities even though they do exist in thedictionaryand the test data. Secondly, even though i'm settingcase_sensitive="false"in both the xml file and the constructor of theDictionaryNameFinder, the
actual results that i 'm getting are always case-sensitive!!!

Can you see any problems with the xml file?

Jim

Re: Fwd: Re: Some questions about Dictionary and DictionaryNameFinder

Reply via email to