Re: Asian Sentence Detector Models

wl.gao.tkl Wed, 21 Mar 2012 22:14:57 -0700

Both languages use a small circle like "。" to signal the end of a sentence.

-----Original Message-----From: James Kosin

Sent: Thursday, March 22, 2012 1:52 PM
To: [email protected]
Subject: Re: Asian Sentence Detector Models

Hi,

Do Chinese and/or Japanese also use word endings to signal the end of a
sentence or thought; or do they use punctuation.

Thanks,
James

On 3/21/2012 10:43 PM, wl-gao wrote:

I am a Chinese, living in japan...

Sent from my iPod

On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote:

Don't worry,
Korean at least has patterns to the end of a sentence or really a
thought....  They have specific endings to the words that key an end of
the thought.

James

On 3/21/2012 7:37 PM, Jörn Kottmann wrote:

I don't know, I never worked with Asian languages,
but it would of course be nice to improve our support in this area.
Especially the basic tasks like sentence detection and
tokenization are of great interest for many.

Jörn


On 03/22/2012 12:22 AM, James Kosin wrote:

Jorn,

If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.

James

On 3/21/2012 2:38 PM, Jörn Kottmann wrote:

Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf

There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.

So we would need to add it as an eos char and
we should create a new feature generator.

Are there any free training data sets which could be used?

Jörn


On 03/21/2012 03:34 PM, Joern Kottmann wrote:

Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?

Jörn

On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]>  <[email protected]
<mailto:[email protected]>>  wrote:

    Hi

    There is a Thai model for sentence detector. I don't know who
    created it,
    but someone from the list knows and can point to some article
    about it.
    What I can say is that OpenNLP had to be customized to work with
Thai,
    including the EOS Characters that are ' ' and '\n'


http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup



    William


    On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
    <[email protected]<mailto:[email protected]>>wrote:

Basically you need to know the punctuation signs indicating

end of

sentence or find someone who does...then use regex to split

the
    sentences

at those signs! it's not gonna be perfect - you may have to

pass
    it once or

twice with your own eyes to make sure everything is ok before

    training.

everything depends on the language and how ambiguous

punctuation
    it has.


Jim

On 20/03/12 18:38, Jairo Sarabia wrote:

Hi all,

I see there aren't Sentence Detect Models for Asian languages

    in openNLP

repository and I need these ones.
I've to train Sentence Detect Models for Chinese, Japanese

and
    Korean

languages, but I don't know these languages.
How coud I get the data train files for these languages?

Thanks in advance!,

Jairo Sarabia

Re: Asian Sentence Detector Models

Reply via email to