[ 
https://issues.apache.org/jira/browse/OPENNLP-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rhead updated OPENNLP-702:
--------------------------

    Description: 
Here's my dictionary:
{code:xml}
<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
  <entry>
    <token>vitamin</token>
    <token>b12</token>
  </entry>
  <entry>
    <token>vitamin</token>
    <token>b</token>
  </entry>
  <entry>
    <token>john</token>
    <token>doe</token>
  </entry>
  <entry>
    <token>john</token>
    <token>d</token>
  </entry>
</dictionary>
{code}

When ran on this sentence using a DictionaryNameFinder: {quote}My name is john 
doe, aka john d. I
like vitamin b12.{quote}

The following tokens are found: {quote}john doe, john d, vitamin b{quote}

As you can see, when the 2nd token ends in a number, the longest match is 
discarded.

(Originally from: 
http://mail-archives.apache.org/mod_mbox/opennlp-users/201406.mbox/%3C1402268906.31205.YahooMailNeo%40web121102.mail.ne1.yahoo.com%3E)

  was:
Here's my dictionary:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
  <entry>
    <token>vitamin</token>
    <token>b12</token>
  </entry>
  <entry>
    <token>vitamin</token>
    <token>b</token>
  </entry>
  <entry>
    <token>john</token>
    <token>doe</token>
  </entry>
  <entry>
    <token>john</token>
    <token>d</token>
  </entry>
</dictionary>

When ran on this sentence using a DictionaryNameFinder: My name is john doe, 
aka john d. I
like vitamin b12.

The following tokens are found: john doe, john d, vitamin b

As you can see, when the 2nd token ends in a number, the longest match is 
discarded.

(Originally from: 
http://mail-archives.apache.org/mod_mbox/opennlp-users/201406.mbox/%3C1402268906.31205.YahooMailNeo%40web121102.mail.ne1.yahoo.com%3E)


> DictionaryNameFinder Not Finding Longest Match When Name Ends in a Number
> -------------------------------------------------------------------------
>
>                 Key: OPENNLP-702
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-702
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder, Tokenizer
>         Environment: Darwin Kernel Version 12.5.0
>            Reporter: rhead
>
> Here's my dictionary:
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <dictionary case_sensitive="false">
>   <entry>
>     <token>vitamin</token>
>     <token>b12</token>
>   </entry>
>   <entry>
>     <token>vitamin</token>
>     <token>b</token>
>   </entry>
>   <entry>
>     <token>john</token>
>     <token>doe</token>
>   </entry>
>   <entry>
>     <token>john</token>
>     <token>d</token>
>   </entry>
> </dictionary>
> {code}
> When ran on this sentence using a DictionaryNameFinder: {quote}My name is 
> john doe, aka john d. I
> like vitamin b12.{quote}
> The following tokens are found: {quote}john doe, john d, vitamin b{quote}
> As you can see, when the 2nd token ends in a number, the longest match is 
> discarded.
> (Originally from: 
> http://mail-archives.apache.org/mod_mbox/opennlp-users/201406.mbox/%3C1402268906.31205.YahooMailNeo%40web121102.mail.ne1.yahoo.com%3E)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to