I'd have to disagree that it is a subset of the "english language" found in
books -- for one thing, one finds a great many more sentence fragments and
lists in clinical records. I have no doubt that training on gutenberg would
yield a reliable sentence detector, but I fear that sentence detector would
be unlikely to perform much better than the existing one.

To be honest, I've started to develop more and more concern about some of
the models used and the training data that was used. The structure of
clinical records vary dramatically between institutions (and, of course,
even between departments at a single institution). I've found that I have
to remain vigilant about the quality of sentence detection in just about
everything I run. This might be unavoidable, but perhaps what we need is an
annotated set of clinical documents culled from a variety of institutions.
Probably a pie in the sky, though ;)

Karthik





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksa...@ksarma.com
gchat: ksa...@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Mon, Aug 26, 2013 at 12:22 PM, John Green <john.travis.gr...@gmail.com>wrote:

> Karthik, well said. There are many differences. I wonder, what do you
> think about the logical division of the two sets? Do they share domain? Is
> one a subset of the other? I would propose that it wouldnt be unreasonable
> to think of clinical notes as being a subset of the english language. It
> seems to me that gutenberg is fairly good average of that english language
> so the superset could contribute to the recognition of the subset.
>
>
>
>
>
>     JG
>
>
>
>
>
>     —
> Sent from Mailbox for iPhone
>
> On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. <masanz.ja...@mayo.edu>
> wrote:
>
> > The corpus used for cTAKES sentence detection is a combination of some
> Mayo Clinic clinical notes that were manually separated into sentences,
> combined with the Penn Treebank (wall street journal)
> > -- James
> > -----Original Message-----
> > From: dev-return-1889-Masanz.James=mayo....@ctakes.apache.org [mailto:
> dev-return-1889-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of
> John Green
> > Sent: Monday, August 26, 2013 11:46 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: apostrophe and sentence detector
> > Just out of curiosity, how was the training data originally built? I
> mean, who separated the lines? By hand? Regex?
> >
> >
> >     Question two: has anyone made attempts at adding project gutenberg
> to the training data for things like sentence detection? Wide variety of
> punctuation in the years a lot of those books were written.
> >
> >
> >     Trying to piece together how it all works,
> >     JG
> >
> >
> >     —
> > Sent from Mailbox for iPhone
> > On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> > <timothy.mil...@childrens.harvard.edu> wrote:
> >> Ah, so we might suspect that some of those 7 lines in the file were
> >> indeed followed by newlines in the original training data. In the
> >> absence of more/better training data which would help us learn this I
> >> think it would be reasonable to restore the list of sentence-breaking
> >> characters to not include apostrophe. Seems like it is rare for a
> >> sentence to end on it, and my preference is to accidentally call 2
> >> sentences one sentence, rather than splitting one sentence in the
> >> middle. I think it's probably better for downstream processing.
> >> Just my .02,
> >> Tim
> >> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> >>> The training data is one sentence per line.
> >>> That's how you feed data to the sentence detector.
> >>>
> >>> -----Original Message-----
> >>> From: dev-return-1884-Masanz.James=mayo....@ctakes.apache.org [mailto:
> dev-return-1884-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of Tim
> Miller
> >>> Sent: Monday, August 26, 2013 11:12 AM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Re: apostrophe and sentence detector
> >>>
> >>>
> >>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> >>>> The recently rebuilt sentence detector (currently in trunk and the
> 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where
> the ctakes-3.0.0-incubating model didn't.
> >>>>
> >>>> The training data used for the recently rebuilt model only contains
> only 7 lines that end with an apostrophe (single quote)
> >>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> >>> sentence detector will currently break on newlines no matter what, so
> >>> the important number is how many sentences end mid-line with an
> >>> apostrophe, right?
> >>> Tim
>

Reply via email to