Why not use the i2b2 corpora? On Monday, September 29, 2014, Dligach, Dmitriy < dmitriy.dlig...@childrens.harvard.edu> wrote:
> Maybe creating a made-up set of sentences would be an option? That way we > could agree on the annotation of concrete cases. Although this would be > more of a unit test than a corpus. > > Dima > > > > > On Sep 27, 2014, at 12:15, Miller, Timothy < > timothy.mil...@childrens.harvard.edu <javascript:;>> wrote: > > > I've just been using the opennlp command line cross validator on the > small dataset i annotated (along with some eyeballing). It would be cool if > there was a standard clinical resource available for this task, but I > hadn't considered it much because the data I annotated pulls from multiple > datasets and the process of arranging with different institutions to make > something like that available would probably be a nightmare. > > Tim > > > > Sent from my iPad. Sorry about the typos. > > > >> On Sep 27, 2014, at 12:16 PM, "Dligach, Dmitriy" < > dmitriy.dlig...@childrens.harvard.edu <javascript:;>> wrote: > >> > >> Tim, thanks for working on this! > >> > >> Question: do we have some formal way of evaluating the sentence > detector? Maybe we should come up with some dev set that would include > examples from mimic... > >> > >> Dima > >> > >> > >> > >> > >>> On Sep 27, 2014, at 8:57, Miller, Timothy < > timothy.mil...@childrens.harvard.edu <javascript:;>> wrote: > >>> > >>> I have been working on the sentence detector newline issue, training a > model to probabilistically split sentences on newlines rather than forcing > sentence breaks. I have checked in a model to the repo under > ctakes-core-res. I also attached a patch to ctakes-core to the jira issue: > >>> https://issues.apache.org/jira/browse/CTAKES-41 > >>> > >>> for people to test. The status of my testing is that it doesn't seem > to break on notes where ctakes worked well before (those where newlines are > always sentence breaks), and is a slight improvement on notes where > newlines may or may not be sentence breaks. Once the change is checked in > we can continue improving the model by adding more data and features, but > the first hurdle I'd like to get past is making sure it runs well enough on > the type of data that the old model worked well on. Let me know if you have > any questions. > >>> > >>> Thanks > >>> Tim > >> > >