How about this idea the training/test set:

1) Start with a document with NO newlines. Perhaps just the entire document is 
a single paragraph. 
2) Then, any sentence detector should be able to parse it correctly.  
3) Then, deterministically add newlines to the document:  some after 
punctuation; some after a word; some after a sentence fragment

Jejo 

On Sep 29, 2014, at 3:43 PM, Chen, Pei <pei.c...@childrens.harvard.edu> wrote:

> Assuming we have a representative training set, are there any objections if 
> we default cTAKES to this SentenceAnnotator + Model?
> For the upcoming release:
> - Consolidate the existing sentence detector, ytex sentence dectector into 
> this new? 
> - Allow a config parameter to still allow an override of a hard break on 
> newline chars.  That way, we won't have maintain multiple sentence annotators 
> and it'll be less confusing for new users...
> 
> --Pei 
> 
> 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>> Sent: Monday, September 29, 2014 2:47 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: sentence detector model
>> 
>> That does sound like it would be useful since MIMIC does have both kinds of
>> linebreak styles in different notes. If I did some annotations on such a
>> dataset would it be re-distributable, say on the physionet website? I believe
>> the ShARe project has a download site there (it is a layer of annotations on
>> MIMIC). Another option would be you posting your raw data there and I
>> could post offset-based annotations on a public repo like github.
>> Tim
>> 
>> 
>> On 09/29/2014 01:54 PM, Peter Szolovits wrote:
>>> I have a set of about 27K documents from MIMIC (circa 2009) in which I
>> have replaced the weird PHI markers by synthesized pseudonymous data.
>> These have natural sentence breaks (typically in the middle of lines), normal
>> paragraph structure, bulleted lists, etc.  Assuming it goes to people who 
>> have
>> signed the MIMIC DUA, I could provide these if you are interested.  --Pete
>> Sz.
>>> 
>>> On Sep 29, 2014, at 1:37 PM, Miller, Timothy
>> <timothy.mil...@childrens.harvard.edu> wrote:
>>> 
>>>> Some of them are a bit artificial for this task, with notes being
>>>> annotated as one sentence per line and offset punctuation. I think
>>>> maybe the 2008 and 2009 data might have original formatting though,
>>>> with newlines not always breaking sentences. That has certain
>>>> advantages over raw MIMIC for training since the PHI isn't so weirdly
>>>> formatted, but then again is not a mix of styles (that is, the styles
>>>> of newline always terminates sentence vs. sometimes terminates
>>>> sentence). I think it would still have to be paired with another dataset to
>> be a representative sample.
>>>> Tim
>>>> 
>>>> On 09/29/2014 01:24 PM, vijay garla wrote:
>>>>> Why not use the i2b2 corpora?
>>>>> 
>>>>> On Monday, September 29, 2014, Dligach, Dmitriy <
>>>>> dmitriy.dlig...@childrens.harvard.edu> wrote:
>>>>> 
>>>>>> Maybe creating a made-up set of sentences would be an option? That
>>>>>> way we could agree on the annotation of concrete cases. Although
>>>>>> this would be more of a unit test than a corpus.
>>>>>> 
>>>>>> Dima
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sep 27, 2014, at 12:15, Miller, Timothy <
>>>>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
>>>>>> 
>>>>>>> I've just been using the opennlp command line cross validator on
>>>>>>> the
>>>>>> small dataset i annotated (along with some eyeballing). It would be
>>>>>> cool if there was a standard clinical resource available for this
>>>>>> task, but I hadn't considered it much because the data I annotated
>>>>>> pulls from multiple datasets and the process of  arranging with
>>>>>> different institutions to make something like that available would
>> probably be a nightmare.
>>>>>>> Tim
>>>>>>> 
>>>>>>> Sent from my iPad. Sorry about the typos.
>>>>>>> 
>>>>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
>>>>>> dmitriy.dlig...@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>>> Tim, thanks for working on this!
>>>>>>>> 
>>>>>>>> Question: do we have some formal way of evaluating the sentence
>>>>>> detector? Maybe we should come up with some dev set that would
>>>>>> include examples from mimic...
>>>>>>>> Dima
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
>>>>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>>>> I have been working on the sentence detector newline issue,
>>>>>>>>> training a
>>>>>> model to probabilistically split sentences on newlines rather than
>>>>>> forcing sentence breaks. I have checked in a model to the repo
>>>>>> under ctakes-core-res. I also attached a patch to ctakes-core to the jira
>> issue:
>>>>>>>>> https://issues.apache.org/jira/browse/CTAKES-41
>>>>>>>>> 
>>>>>>>>> for people to test. The status of my testing is that it doesn't
>>>>>>>>> seem
>>>>>> to break on notes where ctakes worked well before (those where
>>>>>> newlines are always sentence breaks), and is a slight improvement
>>>>>> on notes where newlines may or may not be sentence breaks. Once
>> the
>>>>>> change is checked in we can continue improving the model by adding
>>>>>> more data and features, but the first hurdle I'd like to get past
>>>>>> is making sure it runs well enough on the type of data that the old
>>>>>> model worked well on. Let me know if you have any questions.
>>>>>>>>> Thanks
>>>>>>>>> Tim
>>> 
> 

Reply via email to