Tri,

Unfortunately, it depends on the input language.  Only thing I've found
is it may be better to find the tokens that are punctuation.  A hint is
most tokens that are punctuation are a single character wide.  But,
again that may not be the case depending on the encoding and the
punctuation.  Words are usually a bit longer.

James

On 11/5/2011 2:14 PM, Tri Nguyen wrote:
> Thank you James,
> I don't count the token having pattern ".*[A-Za-z0-9]+.*" and check some
> cases it works.
> The token is not satisfied that pattern can be a punctuation. Is that
> pattern enough to cover a keyword?
> Can we incorporate Lucene and OpenNLP so that the keyword position and
> Named Entity position are compatible?
>
>
> On Sun, Nov 6, 2011 at 12:22 AM, James Kosin <[email protected]> wrote:
>
>> Tri,
>>
>> You could just subtract the number of punctuation tokens from the
>> offsets you get.
>> On 11/5/2011 1:08 PM, Tri Nguyen wrote:
>>> On Sat, Nov 5, 2011 at 11:30 PM, Jörn Kottmann <[email protected]>
>> wrote:
>>>> On 11/5/11 4:53 PM, Tri Nguyen wrote:
>>>>
>>>>> Obama is correct, but Bill Gates. Since the NameFinderME return the
>> token
>>>>> index (position in the token array) not the keyword position (the
>> keyword
>>>>> position in the text). I want to cooperate with keyword position in
>>>>> Lucene.
>>>>>
>>>> What is a keyword position?
>>>>
>>> It is the order of the word in the text.
>>> Ex:
>>> Barack: 0
>>> Obama: 1
>>> president: 3
>>> US: 5
>>> he: 6
>>> 1961: 11
>>> Bill: 12
>>>
>>>> Jörn
>>>>
>>

Reply via email to