Thank you. I've decided to try also a simpler rule based approach and it performs quite well. Anyway this discussion was very useful to me.
Cheers, Riccardo 2013/3/29 William Colen <[email protected]> > Looking again to your sample, I believe you won't be able have good results > using OpenNLP standard learnable Sentence Detector, and maybe any other > ready to use tool. Your segmentation relies on some language knowledge that > is hidden at this level of processing. Maybe you will have to combine > sentence segmentation with POS tagging, or clause categorization to have > good results. > > On Tue, Mar 26, 2013 at 10:30 AM, Jörn Kottmann <[email protected]> > wrote: > > > Hello, > > > > the sentence detector only considers EOS chars as potential > > sentence boundaries, it should not be difficult to extend/modify it so > > that locations detected by user code are used for the split decision. > > > > The iterations specify the maximum number of iterations for an iterative > > machine learning algorithm, and cutoff removes features which did not > > occur at least n times in the training data. > > > > Jörn > > > > > > On 03/26/2013 01:52 PM, Riccardo Tasso wrote: > > > >> Thank you Jörn, in fact the results improved a lot: > >> Precision: 0.5325131810193322 > >> Recall: 0.4745497259201253 > >> F-Measure: 0.5018633540372671 > >> > >> I guess the splitter could have better results if it were able to detect > >> parenthetic structure such as: > >> some text - speech - other text > >> which in my dataset is splitted as: > >> some text > >> - speech - > >> other text > >> Is it possible? > >> > >> Another optimization should be the one which could detect symbols to > end a > >> sentence longer than one character, for example "...". > >> > >> Can you tell me more about the following parameters? > >> > >> - iterations > >> - cutoff > >> > >> Is there any guideline on how tune them? > >> > >> Cheers, > >> Riccardo > >> > >> > >> > >> 2013/3/26 Jörn Kottmann <[email protected]> > >> > >> On 03/26/2013 08:40 AM, Riccardo Tasso wrote: > >>> > >>> Is the Sentence Detector able to split also on non dot characters? In > my > >>>> case there should be also other characters delimiting the end of a > >>>> segment, > >>>> such as: colon (:), dash (-), various kind of quotation marks (", `, > ', > >>>> ...). > >>>> > >>>> The Sentence Detector can only split on end-of-sentence characters, > by > >>> default these > >>> are . ! ? but with 1.5.3 you can set them during training to your > custom > >>> set, there is > >>> a command line argument for it on the Sentence Detector Trainer, haver > a > >>> look at the help. > >>> > >>> If you don't want to compile yourself use the 1.5.3 RC2 which we are > >>> currently testing. > >>> > >>> Jörn > >>> > >>> > >>> > >>> > > >
