Both languages use a small circle like "。" to signal the end of a sentence.
-----Original Message-----
From: James Kosin
Sent: Thursday, March 22, 2012 1:52 PM
To: [email protected]
Subject: Re: Asian Sentence Detector Models
Hi,
Do Chinese and/or Japanese also use word endings to signal the end of a
sentence or thought; or do they use punctuation.
Thanks,
James
On 3/21/2012 10:43 PM, wl-gao wrote:
I am a Chinese, living in japan...
Sent from my iPod
On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote:
Don't worry,
Korean at least has patterns to the end of a sentence or really a
thought.... They have specific endings to the words that key an end of
the thought.
James
On 3/21/2012 7:37 PM, Jörn Kottmann wrote:
I don't know, I never worked with Asian languages,
but it would of course be nice to improve our support in this area.
Especially the basic tasks like sentence detection and
tokenization are of great interest for many.
Jörn
On 03/22/2012 12:22 AM, James Kosin wrote:
Jorn,
If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.
James
On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf
There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.
So we would need to add it as an eos char and
we should create a new feature generator.
Are there any free training data sets which could be used?
Jörn
On 03/21/2012 03:34 PM, Joern Kottmann wrote:
Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?
Jörn
On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]> <[email protected]
<mailto:[email protected]>> wrote:
Hi
There is a Thai model for sentence detector. I don't know who
created it,
but someone from the list knows and can point to some article
about it.
What I can say is that OpenNLP had to be customized to work with
Thai,
including the EOS Characters that are ' ' and '\n'
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
William
On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
<[email protected]<mailto:[email protected]>>wrote:
Basically you need to know the punctuation signs indicating
end of
sentence or find someone who does...then use regex to split
the
sentences
at those signs! it's not gonna be perfect - you may have to
pass
it once or
twice with your own eyes to make sure everything is ok before
training.
everything depends on the language and how ambiguous
punctuation
it has.
Jim
On 20/03/12 18:38, Jairo Sarabia wrote:
Hi all,
I see there aren't Sentence Detect Models for Asian languages
in openNLP
repository and I need these ones.
I've to train Sentence Detect Models for Chinese, Japanese
and
Korean
languages, but I don't know these languages.
How coud I get the data train files for these languages?
Thanks in advance!,
Jairo Sarabia