The corpus used for cTAKES sentence detection is a combination of some Mayo Clinic clinical notes that were manually separated into sentences, combined with the Penn Treebank (wall street journal)
-- James -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of John Green Sent: Monday, August 26, 2013 11:46 AM To: [email protected] Subject: Re: apostrophe and sentence detector Just out of curiosity, how was the training data originally built? I mean, who separated the lines? By hand? Regex? Question two: has anyone made attempts at adding project gutenberg to the training data for things like sentence detection? Wide variety of punctuation in the years a lot of those books were written. Trying to piece together how it all works, JG — Sent from Mailbox for iPhone On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller <[email protected]> wrote: > Ah, so we might suspect that some of those 7 lines in the file were > indeed followed by newlines in the original training data. In the > absence of more/better training data which would help us learn this I > think it would be reasonable to restore the list of sentence-breaking > characters to not include apostrophe. Seems like it is rare for a > sentence to end on it, and my preference is to accidentally call 2 > sentences one sentence, rather than splitting one sentence in the > middle. I think it's probably better for downstream processing. > Just my .02, > Tim > On 08/26/2013 12:29 PM, Masanz, James J. wrote: >> The training data is one sentence per line. >> That's how you feed data to the sentence detector. >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf >> Of Tim Miller >> Sent: Monday, August 26, 2013 11:12 AM >> To: [email protected] >> Subject: Re: apostrophe and sentence detector >> >> >> On 08/26/2013 12:05 PM, Masanz, James J. wrote: >>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 >>> branch) is sometimes taking the apostrophe as a sentence break where the >>> ctakes-3.0.0-incubating model didn't. >>> >>> The training data used for the recently rebuilt model only contains only 7 >>> lines that end with an apostrophe (single quote) >> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The >> sentence detector will currently break on newlines no matter what, so >> the important number is how many sentences end mid-line with an >> apostrophe, right? >> Tim
