Jim, The format is wrong. We already asked you to try using the DictionaryBuilder tool:
input.txt: -------- Lepirudin Cetuximab Dornase Alfa Denileukin diftitox Etanercept Bivalirudin Leuprolide Peginterferon alfa-2a Alteplase -------- command: bin/opennlp DictionaryBuilder -inputFile input.txt -outputFile output.xml -encoding <encoding of inputFile> output.xml ------ <?xml version="1.0" encoding="UTF-8"?> <dictionary case_sensitive="false"> <entry> <token>Etanercept</token> </entry> <entry> <token>Dornase</token> <token>Alfa</token> </entry> <entry> <token>Peginterferon</token> <token>alfa-2a</token> </entry> <entry> <token>Alteplase</token> </entry> <entry> <token>Leuprolide</token> </entry> <entry> <token>Denileukin</token> <token>diftitox</token> </entry> <entry> <token>Bivalirudin</token> </entry> <entry> <token>Cetuximab</token> </entry> <entry> <token>Lepirudin</token> </entry> </dictionary> ------ Regards, William On Fri, Feb 24, 2012 at 8:38 AM, Jim - FooBar(); <[email protected]>wrote: > On 24/02/12 05:09, James Kosin wrote: > >> Jim, >> >> Maybe the problem is how you have created the dictionary. The >> DictionaryNameFinder's find() method is a greedy method that will match >> as many tokens as possible. >> If it isn't matching more than one token than that is probably all the >> dictionary contains per entry. >> >> Look at the simple example in the test packages for >> opennlp.tools.namefind DictionaryNameFinderTest.java in the source >> packages. >> >> There has a good example. >> >> James >> > > Hi James, > > Well, the dictionary i created manually...basically i extracted all the > drug-names from drugbank.xml and wrote them to a txt file (one entry per > line). then i processed that text-file in order to produce the xml version > of the proper dictionary. What i have after doing all that is a file with > contents of the type: > > <?xml version="1.0" encoding="UTF-8"?> > <dictionary case_sensitive="false"> > <entry><token>Lepirudin</**token></entry> > <entry><token>Cetuximab</**token></entry> > <entry><token>Dornase Alfa</token></entry> > <entry><token>Denileukin diftitox</token></entry> > <entry><token>Etanercept</**token></entry> > <entry><token>Bivalirudin</**token></entry> > <entry><token>Leuprolide</**token></entry> > <entry><token>Peginterferon alfa-2a</token></entry> > <entry><token>Alteplase</**token></entry> > ...... > ...... > ......etc etc > > As you can see some drugs are multi-word entities and also the first > character of each word is capitalized. Whenever i call the find() method > all i'm getting are the exact matches which means that case-sensitivity > doesn ot work either!!! For example i'm getting "Cetuximab" but not > "cetuximab"...so the problem is twofold...Firstly and more importantly I > cannot find multi-word entities even though they do exist in the dictionary > and the test data. Secondly, even though i'm setting case_sensitive="false" > in both the xml file and the constructor of the DictionaryNameFinder, the > actual results that i 'm getting are always case-sensitive!!! > > Can you see any problems with the xml file? > > Jim > >
