Don't worry, Korean at least has patterns to the end of a sentence or really a thought.... They have specific endings to the words that key an end of the thought.
James On 3/21/2012 7:37 PM, Jörn Kottmann wrote: > I don't know, I never worked with Asian languages, > but it would of course be nice to improve our support in this area. > Especially the basic tasks like sentence detection and > tokenization are of great interest for many. > > Jörn > > > On 03/22/2012 12:22 AM, James Kosin wrote: >> Jorn, >> >> If there isn't anything for Korean, I could put something together. >> Only problem would be getting free text. >> I can start looking if needed. >> >> James >> >> On 3/21/2012 2:38 PM, Jörn Kottmann wrote: >>> Here is a paper which describes Chinese sentence segmentation: >>> www.aclweb.org/anthology/P/P11/P11-2111.pdf >>> >>> There they say that commas can be an end-of-sentence marker as well, >>> but they are ambiguous. >>> >>> So we would need to add it as an eos char and >>> we should create a new feature generator. >>> >>> Are there any free training data sets which could be used? >>> >>> Jörn >>> >>> >>> On 03/21/2012 03:34 PM, Joern Kottmann wrote: >>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous >>>> sentence-ending markers." >>>> In this case we might be able to write a rule based sentence detector >>>> for these languages? >>>> >>>> Jörn >>>> >>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected] >>>> <mailto:[email protected]> <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi >>>> >>>> There is a Thai model for sentence detector. I don't know who >>>> created it, >>>> but someone from the list knows and can point to some article >>>> about it. >>>> What I can say is that OpenNLP had to be customized to work with >>>> Thai, >>>> including the EOS Characters that are ' ' and '\n' >>>> >>>> >>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup >>>> >>>> >>>> >>>> William >>>> >>>> >>>> On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar(); >>>> <[email protected]<mailto:[email protected]>>wrote: >>>> >>>> > Basically you need to know the punctuation signs indicating >>>> end of >>>> > sentence or find someone who does...then use regex to split >>>> the >>>> sentences >>>> > at those signs! it's not gonna be perfect - you may have to >>>> pass >>>> it once or >>>> > twice with your own eyes to make sure everything is ok before >>>> training. >>>> > everything depends on the language and how ambiguous >>>> punctuation >>>> it has. >>>> > >>>> > >>>> > Jim >>>> > >>>> > On 20/03/12 18:38, Jairo Sarabia wrote: >>>> > >>>> >> Hi all, >>>> >> >>>> >> I see there aren't Sentence Detect Models for Asian languages >>>> in openNLP >>>> >> repository and I need these ones. >>>> >> I've to train Sentence Detect Models for Chinese, Japanese >>>> and >>>> Korean >>>> >> languages, but I don't know these languages. >>>> >> How coud I get the data train files for these languages? >>>> >> >>>> >> Thanks in advance!, >>>> >> >>>> >> Jairo Sarabia >>>> >> >>>> >> >>>> > >>>> >>>> >>> >
