I don't know, I never worked with Asian languages,
but it would of course be nice to improve our support in this area.
Especially the basic tasks like sentence detection and
tokenization are of great interest for many.
Jörn
On 03/22/2012 12:22 AM, James Kosin wrote:
Jorn,
If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.
James
On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf
There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.
So we would need to add it as an eos char and
we should create a new feature generator.
Are there any free training data sets which could be used?
Jörn
On 03/21/2012 03:34 PM, Joern Kottmann wrote:
Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?
Jörn
On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]> <[email protected]
<mailto:[email protected]>> wrote:
Hi
There is a Thai model for sentence detector. I don't know who
created it,
but someone from the list knows and can point to some article
about it.
What I can say is that OpenNLP had to be customized to work with
Thai,
including the EOS Characters that are ' ' and '\n'
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
William
On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
<[email protected]<mailto:[email protected]>>wrote:
> Basically you need to know the punctuation signs indicating end of
> sentence or find someone who does...then use regex to split the
sentences
> at those signs! it's not gonna be perfect - you may have to pass
it once or
> twice with your own eyes to make sure everything is ok before
training.
> everything depends on the language and how ambiguous punctuation
it has.
>
>
> Jim
>
> On 20/03/12 18:38, Jairo Sarabia wrote:
>
>> Hi all,
>>
>> I see there aren't Sentence Detect Models for Asian languages
in openNLP
>> repository and I need these ones.
>> I've to train Sentence Detect Models for Chinese, Japanese and
Korean
>> languages, but I don't know these languages.
>> How coud I get the data train files for these languages?
>>
>> Thanks in advance!,
>>
>> Jairo Sarabia
>>
>>
>