Hi, Do Chinese and/or Japanese also use word endings to signal the end of a sentence or thought; or do they use punctuation.
Thanks, James On 3/21/2012 10:43 PM, wl-gao wrote: > I am a Chinese, living in japan... > > Sent from my iPod > > On 2012/03/22, at 8:42, James Kosin <[email protected]> wrote: > >> Don't worry, >> Korean at least has patterns to the end of a sentence or really a >> thought.... They have specific endings to the words that key an end of >> the thought. >> >> James >> >> On 3/21/2012 7:37 PM, Jörn Kottmann wrote: >>> I don't know, I never worked with Asian languages, >>> but it would of course be nice to improve our support in this area. >>> Especially the basic tasks like sentence detection and >>> tokenization are of great interest for many. >>> >>> Jörn >>> >>> >>> On 03/22/2012 12:22 AM, James Kosin wrote: >>>> Jorn, >>>> >>>> If there isn't anything for Korean, I could put something together. >>>> Only problem would be getting free text. >>>> I can start looking if needed. >>>> >>>> James >>>> >>>> On 3/21/2012 2:38 PM, Jörn Kottmann wrote: >>>>> Here is a paper which describes Chinese sentence segmentation: >>>>> www.aclweb.org/anthology/P/P11/P11-2111.pdf >>>>> >>>>> There they say that commas can be an end-of-sentence marker as well, >>>>> but they are ambiguous. >>>>> >>>>> So we would need to add it as an eos char and >>>>> we should create a new feature generator. >>>>> >>>>> Are there any free training data sets which could be used? >>>>> >>>>> Jörn >>>>> >>>>> >>>>> On 03/21/2012 03:34 PM, Joern Kottmann wrote: >>>>>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous >>>>>> sentence-ending markers." >>>>>> In this case we might be able to write a rule based sentence detector >>>>>> for these languages? >>>>>> >>>>>> Jörn >>>>>> >>>>>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected] >>>>>> <mailto:[email protected]> <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Hi >>>>>> >>>>>> There is a Thai model for sentence detector. I don't know who >>>>>> created it, >>>>>> but someone from the list knows and can point to some article >>>>>> about it. >>>>>> What I can say is that OpenNLP had to be customized to work with >>>>>> Thai, >>>>>> including the EOS Characters that are ' ' and '\n' >>>>>> >>>>>> >>>>>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup >>>>>> >>>>>> >>>>>> >>>>>> William >>>>>> >>>>>> >>>>>> On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar(); >>>>>> <[email protected]<mailto:[email protected]>>wrote: >>>>>> >>>>>>> Basically you need to know the punctuation signs indicating >>>>>> end of >>>>>>> sentence or find someone who does...then use regex to split >>>>>> the >>>>>> sentences >>>>>>> at those signs! it's not gonna be perfect - you may have to >>>>>> pass >>>>>> it once or >>>>>>> twice with your own eyes to make sure everything is ok before >>>>>> training. >>>>>>> everything depends on the language and how ambiguous >>>>>> punctuation >>>>>> it has. >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> On 20/03/12 18:38, Jairo Sarabia wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I see there aren't Sentence Detect Models for Asian languages >>>>>> in openNLP >>>>>>>> repository and I need these ones. >>>>>>>> I've to train Sentence Detect Models for Chinese, Japanese >>>>>> and >>>>>> Korean >>>>>>>> languages, but I don't know these languages. >>>>>>>> How coud I get the data train files for these languages? >>>>>>>> >>>>>>>> Thanks in advance!, >>>>>>>> >>>>>>>> Jairo Sarabia >>>>>>>> >>>>>>>> >>>>>>
