Re: sentence splitter & forks/branches

Miller, Timothy Sat, 18 Jan 2014 05:03:35 -0800

Sorry Paula, it's been a busy few weeks. I'm sure everyone else has been
busy as well.


I'm sorry to say I think at this point it might be difficult to get the
exact fix you want out of the module. It works in 2 parts I believe:
1) Identify cue words
2) Classify entities given the identified cue words.

And you fixed 1) to recognize your cue word, but if 2) uses a machine
learning model it may not get the right outcome sometimes and that can
be hard to fix. It obviously wouldn't have seen any examples using that
keyword, though I might've thought that there might be some cases it
would get right using other features.

If you've tried a bunch of different examples and it seems like it can't
get any of them right with new cue words, then there are a few things
you might consider as next steps:

1) Write your own rule-based analysis engine to follow the existing
assertion module and use some simple algorithm to link your cue words
with nearby entities.
2) Acquire training data and try to re-train the assertion module with
your cue word additions. I believe they used the i2b2 challenge 2010
concept assertion dataset which is available with a data use agreement.

Hope this helps,
Tim



On 01/17/2014 10:46 PM, digital paula wrote:
>
>
> Hello again cTAKES Community,  I thought that adding the sentence 
> splitter(w/newline-sentence-continuation-recognition) would have been as 
> simple as it was adding the sectionizer annotator to the eclipse environment. 
>  I see per VJ's note that it's not that simple, my understanding is that the 
> standard clinical pipeline requires the assertion and dependency parsers. 
> I've explored a bit of the changes needed and at least for Assertion looks 
> like SentenceDetector, SentenceSpan, likely the SingleDocumentProcessor from 
> the MITRE jar will need to be modified to recognize multi-line sentences.   
> This is so the assertion and dependency parsers can be kept in the pipeline.  
> I would love to devote the time needed to fix the sentence split to recognize 
> sentences that are multiline but I need to focus on hacking my way through 
> the cue word issue because I've been left in the lurch with no response to my 
> posts  :-(((((  
> Regards,
> Paula
>  
>> Date: Wed, 15 Jan 2014 14:53:17 -0500
>> Subject: Re: sentence splitter & forks/branches
>> From: [email protected]
>> To: [email protected]
>>
>> It is unfortunately not that trivial, as allowing newlines within sentences
>> requires changes to the assertion and dependency parser modules.
>>
>> If you're not using those AEs you could theoretically build the ytex
>> branch, and just add  ctakes-ytex-uima.jar and
>> ctakes-ytex-uima\desc\analysis_engine\SentenceDetectorAnnotator.xml to your
>> exsting ctakes install (haven't tried it, but it should work).
>>
>> -vj
>>
>>
>> On Wed, Jan 15, 2014 at 1:57 PM, Lingren, Todd <[email protected]>wrote:
>>
>>> I have a general question about forks, specifically the YTEX branch that
>>> Vijay mentions.
>>> If I wanted to implement just the sentence splitter from YTEX into a
>>> currently existing 3.1 install, how would I do that? Is it possible? Or do
>>> I have to switch over completely to run from YTEX branch?
>>>
>>> Todd Lingren
>>> Biomedical Informatics
>>> Cincinnati Children's Hospital
>>> [email protected]
>>> 513-803-9032
>>>
>>>
>>> -----Original Message-----
>>> From: vijay garla [mailto:[email protected]]
>>> Sent: Wednesday, January 15, 2014 11:34 AM
>>> To: [email protected]
>>> Subject: Re: svn commit: r1551805 -
>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesImpl.java
>>>
>>> The issue is indeed the sentence splitter - negation is limited to words
>>> within the sentence, and if newlines are considered sentence boundaries, it
>>> doesn't work properly (splitting on newlines breaks many other things as
>>> well).  The YTEX branch includes a sentence splitter that does not
>>> automatically split sentences on newlines.
>>>
>>> best,
>>>
>>> vj
>>>
>>>
>>> On Wed, Jan 15, 2014 at 10:03 AM, Masanz, James J. <[email protected]
>>>> wrote:
>>>> Hi Paula,
>>>>
>>>> The sentence detector in 3.1.0 and 3.1.1 (and previous releases)
>>>> assumes sentences don't cross line boundaries.
>>>> OpenNLP is used to find sentence breaks, but then if newlines are
>>>> found, those are also set (within cTAKES, not OpenNLP) to be sentence
>>> breaks.
>>>> (just FYI I haven't had a chance to look at the ytex branch, which the
>>>> subject commit is about)
>>>>
>>>> -- James
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:
>>>> [email protected]] On Behalf Of
>>>> digital paula
>>>> Sent: Tuesday, January 14, 2014 10:25 PM
>>>> To: [email protected]
>>>> Subject: RE: svn commit: r1551805 -
>>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
>>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
>>>> Impl.java
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hello cTAKES Developer Community,
>>>>  I'm a little behind on reading posts....this one is from last month.
>>>> I think this issue is already addressed in current release? I'm still
>>>> running the previous release...3.1.0.
>>>> I just noticed something interesting, the negation didn't take when it
>>>> is on a different line.  I just removed all carriage returns from
>>> narratives
>>>> and negation picked it up as long as it's treated as one long string.
>>> To
>>>> better explain what I mean.  Two narrative comments below.
>>>>
>>>> 1.  patient did not have diabetes
>>>> 2. patient did not have
>>>> diabetes
>>>>
>>>> Number 1 above got negated but number 2 did not. This might be related
>>>> to the issue w/the sectionizer.  I noticed that when I treated the
>>> narrative
>>>> as one string the sectionizer never crashes with the NPE.   Well the
>>>> sectionizer is of no point if narrative is as one string but it's
>>>> helping me pinpoint the problem.
>>>>
>>>> Regards,
>>>> Paula
>>>>
>>>>
>>>>> Date: Thu, 19 Dec 2013 11:04:57 -0500
>>>>> Subject: Re: FW: svn commit: r1551805 -
>>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
>>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
>>>> Impl.java
>>>>> From: [email protected]
>>>>> To: [email protected]
>>>>>
>>>>> Hi Pei,
>>>>>
>>>>> I'm not sure if that would solve the problem: change in the ytex
>>>>> branch causes newlines to be ignored (i.e. not treated as a token).
>>>>> trunk's sentence splitter is splits sentences on newlines, so
>>>>> newlines would
>>>> never
>>>>> be found in a sentence.  However, if we had a reproducer we could
>>>>> check
>>>> it
>>>>> fairly easily in the ytex branch.
>>>>>
>>>>> Best,
>>>>>
>>>>> VJ
>>>>>
>>>>>
>>>>> On Thu, Dec 19, 2013 at 10:15 AM, Chen, Pei
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Vj,
>>>>>> Do you think this is what was causing the NPE's [1]?
>>>>>> If so, shall we make the same fix in trunk?
>>>>>> --Pei
>>>>>>
>>>>>> [1]
>>>>>>
>>>> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201309.mbox/%3C924
>>>> DE05C19409B438EB81DE683A942D9105A93CB%40CHEXMBX1A.CHBOSTON.ORG%3E
>>>>>> -----Original Message-----
>>>>>> From: [email protected] [mailto:[email protected]]
>>>>>> Sent: Tuesday, December 17, 2013 9:15 PM
>>>>>> To: [email protected]
>>>>>> Subject: svn commit: r1551805 -
>>>>>>
>>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
>>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
>>>> Impl.java
>>>>>> Author: vjapache
>>>>>> Date: Wed Dec 18 02:14:13 2013
>>>>>> New Revision: 1551805
>>>>>>
>>>>>> URL: http://svn.apache.org/r1551805
>>>>>> Log:
>>>>>> add support for sentences that contain newline tokens.
>>>>>>
>>>>>> Modified:
>>>>>>
>>>>>>
>>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
>>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
>>>> mpl.java
>>>>>> Modified:
>>>>>>
>>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
>>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
>>>> mpl.java
>>>>>> URL:
>>>>>>
>>>> http://svn.apache.org/viewvc/ctakes/branches/ytex/ctakes-assertion/src
>>>> /main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffs
>>>> etToLineTokenConverterCtakesImpl.java?rev=1551805&r1=1551804&r2=155180
>>>> 5&view=diff
>>>>>>
>>>> ======================================================================
>>>> ========
>>>>>> ---
>>>>>>
>>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
>>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
>>>> mpl.java
>>>>>> (original)
>>>>>> +++
>>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake
>>>>>> +++
>>>> s/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCta
>>>>>> +++ kesImpl.java Wed Dec 18 02:14:13 2013
>>>>>> @@ -32,8 +32,8 @@ import org.apache.uima.jcas.tcas.Annotat  import
>>>>>> org.mitre.medfacts.i2b2.api.ApiConcept;
>>>>>>  import
>>>>>> org.mitre.medfacts.zoner.CharacterOffsetToLineTokenConverter;
>>>>>>  import org.mitre.medfacts.zoner.LineAndTokenPosition;
>>>>>> -
>>>>>>  import org.apache.ctakes.typesystem.type.syntax.BaseToken;
>>>>>> +import org.apache.ctakes.typesystem.type.syntax.NewlineToken;
>>>>>>  import org.apache.ctakes.typesystem.type.textspan.Sentence;
>>>>>>
>>>>>>  public class CharacterOffsetToLineTokenConverterCtakesImpl
>>>>>> implements CharacterOffsetToLineTokenConverter
>>>>>> @@ -78,11 +78,13 @@ public class CharacterOffsetToLineTokenC
>>>>>>           for (Annotation current : annotationIndex)
>>>>>>           {
>>>>>>                   BaseToken bt = (BaseToken)current;
>>>>>> -                 int begin = bt.getBegin();
>>>>>> -                 int end = bt.getEnd();
>>>>>> -
>>>>>> -                 tokenBeginEndTreeSet.add(begin);
>>>>>> -                 tokenBeginEndTreeSet.add(end);
>>>>>> +                 // filter out NewlineToken
>>>>>> +                 if (!(bt instanceof NewlineToken)) {
>>>>>> +                         int begin = bt.getBegin();
>>>>>> +                         int end = bt.getEnd();
>>>>>> +                         tokenBeginEndTreeSet.add(begin);
>>>>>> +                         tokenBeginEndTreeSet.add(end);
>>>>>> +                 }
>>>>>>           }
>>>>>>    }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>

Re: sentence splitter & forks/branches

Reply via email to