Wikipedia says: "Languages like Japanese and Chinese have unambiguous sentence-ending markers." In this case we might be able to write a rule based sentence detector for these languages?
Jörn On Wed, Mar 21, 2012 at 3:18 PM, [email protected] < [email protected]> wrote: > Hi > > There is a Thai model for sentence detector. I don't know who created it, > but someone from the list knows and can point to some article about it. > What I can say is that OpenNLP had to be customized to work with Thai, > including the EOS Characters that are ' ' and '\n' > > > http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup > > > William > > > On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar(); <[email protected] > >wrote: > > > Basically you need to know the punctuation signs indicating end of > > sentence or find someone who does...then use regex to split the sentences > > at those signs! it's not gonna be perfect - you may have to pass it once > or > > twice with your own eyes to make sure everything is ok before training. > > everything depends on the language and how ambiguous punctuation it has. > > > > > > Jim > > > > On 20/03/12 18:38, Jairo Sarabia wrote: > > > >> Hi all, > >> > >> I see there aren't Sentence Detect Models for Asian languages in openNLP > >> repository and I need these ones. > >> I've to train Sentence Detect Models for Chinese, Japanese and Korean > >> languages, but I don't know these languages. > >> How coud I get the data train files for these languages? > >> > >> Thanks in advance!, > >> > >> Jairo Sarabia > >> > >> > > >
