I am a Chinese, living in japan... Sent from my iPod
On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote: > Don't worry, > Korean at least has patterns to the end of a sentence or really a > thought.... They have specific endings to the words that key an end of > the thought. > > James > > On 3/21/2012 7:37 PM, Jörn Kottmann wrote: >> I don't know, I never worked with Asian languages, >> but it would of course be nice to improve our support in this area. >> Especially the basic tasks like sentence detection and >> tokenization are of great interest for many. >> >> Jörn >> >> >> On 03/22/2012 12:22 AM, James Kosin wrote: >>> Jorn, >>> >>> If there isn't anything for Korean, I could put something together. >>> Only problem would be getting free text. >>> I can start looking if needed. >>> >>> James >>> >>> On 3/21/2012 2:38 PM, Jörn Kottmann wrote: >>>> Here is a paper which describes Chinese sentence segmentation: >>>> www.aclweb.org/anthology/P/P11/P11-2111.pdf >>>> >>>> There they say that commas can be an end-of-sentence marker as well, >>>> but they are ambiguous. >>>> >>>> So we would need to add it as an eos char and >>>> we should create a new feature generator. >>>> >>>> Are there any free training data sets which could be used? >>>> >>>> Jörn >>>> >>>> >>>> On 03/21/2012 03:34 PM, Joern Kottmann wrote: >>>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous >>>>> sentence-ending markers." >>>>> In this case we might be able to write a rule based sentence detector >>>>> for these languages? >>>>> >>>>> Jörn >>>>> >>>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected] >>>>> <mailto:[email protected]> <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Hi >>>>> >>>>> There is a Thai model for sentence detector. I don't know who >>>>> created it, >>>>> but someone from the list knows and can point to some article >>>>> about it. >>>>> What I can say is that OpenNLP had to be customized to work with >>>>> Thai, >>>>> including the EOS Characters that are ' ' and '\n' >>>>> >>>>> >>>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup >>>>> >>>>> >>>>> >>>>> William >>>>> >>>>> >>>>> On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar(); >>>>> <[email protected]<mailto:[email protected]>>wrote: >>>>> >>>>>> Basically you need to know the punctuation signs indicating >>>>> end of >>>>>> sentence or find someone who does...then use regex to split >>>>> the >>>>> sentences >>>>>> at those signs! it's not gonna be perfect - you may have to >>>>> pass >>>>> it once or >>>>>> twice with your own eyes to make sure everything is ok before >>>>> training. >>>>>> everything depends on the language and how ambiguous >>>>> punctuation >>>>> it has. >>>>>> >>>>>> >>>>>> Jim >>>>>> >>>>>> On 20/03/12 18:38, Jairo Sarabia wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I see there aren't Sentence Detect Models for Asian languages >>>>> in openNLP >>>>>>> repository and I need these ones. >>>>>>> I've to train Sentence Detect Models for Chinese, Japanese >>>>> and >>>>> Korean >>>>>>> languages, but I don't know these languages. >>>>>>> How coud I get the data train files for these languages? >>>>>>> >>>>>>> Thanks in advance!, >>>>>>> >>>>>>> Jairo Sarabia >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >> >
