That does sound like it would be useful since MIMIC does have both kinds
of linebreak styles in different notes. If I did some annotations on
such a dataset would it be re-distributable, say on the physionet
website? I believe the ShARe project has a download site there (it is a
layer of annotations on MIMIC). Another option would be you posting your
raw data there and I could post offset-based annotations on a public
repo like github.
Tim


On 09/29/2014 01:54 PM, Peter Szolovits wrote:
> I have a set of about 27K documents from MIMIC (circa 2009) in which I have 
> replaced the weird PHI markers by synthesized pseudonymous data.  These have 
> natural sentence breaks (typically in the middle of lines), normal paragraph 
> structure, bulleted lists, etc.  Assuming it goes to people who have signed 
> the MIMIC DUA, I could provide these if you are interested.  --Pete Sz.
>
> On Sep 29, 2014, at 1:37 PM, Miller, Timothy 
> <timothy.mil...@childrens.harvard.edu> wrote:
>
>> Some of them are a bit artificial for this task, with notes being
>> annotated as one sentence per line and offset punctuation. I think maybe
>> the 2008 and 2009 data might have original formatting though, with
>> newlines not always breaking sentences. That has certain advantages over
>> raw MIMIC for training since the PHI isn't so weirdly formatted, but
>> then again is not a mix of styles (that is, the styles of newline always
>> terminates sentence vs. sometimes terminates sentence). I think it would
>> still have to be paired with another dataset to be a representative sample.
>> Tim
>>
>> On 09/29/2014 01:24 PM, vijay garla wrote:
>>> Why not use the i2b2 corpora?
>>>
>>> On Monday, September 29, 2014, Dligach, Dmitriy <
>>> dmitriy.dlig...@childrens.harvard.edu> wrote:
>>>
>>>> Maybe creating a made-up set of sentences would be an option? That way we
>>>> could agree on the annotation of concrete cases. Although this would be
>>>> more of a unit test than a corpus.
>>>>
>>>> Dima
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 27, 2014, at 12:15, Miller, Timothy <
>>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
>>>>
>>>>> I've just been using the opennlp command line cross validator on the
>>>> small dataset i annotated (along with some eyeballing). It would be cool if
>>>> there was a standard clinical resource available for this task, but I
>>>> hadn't considered it much because the data I annotated pulls from multiple
>>>> datasets and the process of  arranging with different institutions to make
>>>> something like that available would probably be a nightmare.
>>>>> Tim
>>>>>
>>>>> Sent from my iPad. Sorry about the typos.
>>>>>
>>>>>> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" <
>>>> dmitriy.dlig...@childrens.harvard.edu <javascript:;>> wrote:
>>>>>> Tim, thanks for working on this!
>>>>>>
>>>>>> Question: do we have some formal way of evaluating the sentence
>>>> detector? Maybe we should come up with some dev set that would include
>>>> examples from mimic...
>>>>>> Dima
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sep 27, 2014, at 8:57, Miller, Timothy <
>>>> timothy.mil...@childrens.harvard.edu <javascript:;>> wrote:
>>>>>>> I have been working on the sentence detector newline issue, training a
>>>> model to probabilistically split sentences on newlines rather than forcing
>>>> sentence breaks. I have checked in a model to the repo under
>>>> ctakes-core-res. I also attached a patch to ctakes-core to the jira issue:
>>>>>>> https://issues.apache.org/jira/browse/CTAKES-41
>>>>>>>
>>>>>>> for people to test. The status of my testing is that it doesn't seem
>>>> to break on notes where ctakes worked well before (those where newlines are
>>>> always sentence breaks), and is a slight improvement on notes where
>>>> newlines may or may not be sentence breaks. Once the change is checked in
>>>> we can continue improving the model by adding more data and features, but
>>>> the first hurdle I'd like to get past is making sure it runs well enough on
>>>> the type of data that the old model worked well on. Let me know if you have
>>>> any questions.
>>>>>>> Thanks
>>>>>>> Tim
>

Reply via email to