Re: NPE after upgrade in DefaultJCASTermAnnotator [EXTERNAL]

2017-10-02 Thread Jeff Headley
Thank you Sean. That helped to figure out what we did. Not quite sure where
we went wrong but now at least we know the cause. So a long time ago in our
project using ctakes, we emptied out the
tables CUI_TERMS, RXNORM, PREFTERM, and TUI and then loaded them with the
values we wanted. Worked great. Now in the new version
the 
/desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextFastUMLSProcessor.xml
engine seems to be
using 
/resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab/sno_rx_16ab
and that seems to be where things went sideways. If I don't mess with the
db and keep the original, no errors.

So somewhere in this if statement at line 102 in DefaultJCASTermAnnotator:
if ( hitTokens[ hit ].equals( allTokens.get( i ).getText() )
  || hitTokens[ hit ].equals( allTokens.get( i ).getVariant() )
) {

It's expecting to not ever have a null and I suspect we are leaving
something null somewhere that really shouldn't have nulls. If it's obvioius
where I've went wrong, the assistance would be appreciated. Otherwise, I'll
get it figured out eventually. I suspect it's possibly because we never did
anything with the SNOMEDCT_US in the prior version.

On Mon, Oct 2, 2017 at 10:47 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Jeff,
>
> I have no problem running on your example "DIDANOSINE, 250MG (PO Capsule
> Delayed Release)" or any other text.
>
> I don't know how you  are running ctakes through com.clientproject.ctakes.
> processors.CommandLineProcessor, so I don't know how closely the standard
> pipeline approximates yours.
>
> Sean
>
> -Original Message-
> From: Jeff Headley [mailto:jeffun...@gmail.com]
> Sent: Sunday, October 01, 2017 11:31 PM
> To: dev@ctakes.apache.org
> Subject: NPE after upgrade in DefaultJCASTermAnnotator [EXTERNAL]
>
> After upgrading our project to version 4, we are getting a NPE from cTAKES.
> The text that was being processed was DIDANOSINE, 250MG (PO Capsule
> Delayed Release), though it seems to be happening to us no matter what text
> we submit.  The stack trace is below. Any help would be appreciated as I'm
> at a loss at to what we might be doing wrong if this is not a bug in cTAKES.
>
> Thank you,
> Jeff
>
> Oct 01, 2017 11:10:16 PM
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl
> processAndOutputNewCASes(273)
> SEVERE: Exception occurred
> org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator
> processing failed.
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:412)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:314)
> at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.
> processUntilNextOutputCas(ASB_impl.java:570)
> at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$
> AggregateCasIterator.(ASB_impl.java:412)
> at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.
> process(ASB_impl.java:344)
> at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.
> processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
> at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:269)
> at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:284)
> at
> com.clientproject.ctakes.processors.CommandLineProcessor.processLine(
> CommandLineProcessor.java:163)
> at
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.
> java:1374)
> at
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.
> java:580)
> at
> com.clientproject.ctakes.processors.CommandLineProcessor.run(
> CommandLineProcessor.java:114)
> at com.clientproject.ctakes.App.main(App.java:109)
> Caused by: java.lang.NullPointerException at
> org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator.
> isTermMatch(DefaultJCasTermAnnotator.java:102)
> at
> org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator.
> findTerms(DefaultJCasTermAnnotator.java:79)
> at
> org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.
> findTerms(AbstractJCasTermAnnotator.java:236)
> at
> org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.
> processWindow(AbstractJCasTermAnnotator.java:219)
> at
> org.apache.ctakes.dictionary.lookup2.ae.AbstractJCasTermAnnotator.process(
> AbstractJCasTermAnnotator.java:156)
> at
> org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(
> JCasAnnotator_ImplBase.java:48)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:396)
> ... 12 more
>


Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

2017-10-02 Thread Miller, Timothy
Yeah, it might be nice to build a lucene index of all the sample notes in the 
ctakes-example module. I'll create a jira for it but probably won't be able to 
get to it right away.
Tim


From: Alexandru Zbarcea 
Sent: Monday, October 2, 2017 5:31 PM
To: Apache cTAKES Dev
Subject: Re: Missing resources for script that extracts markables from a corpus 
for analysis [EXTERNAL]

Hi Tim,

I understand, makes sense. Is it possible to anonymize the data you have or
come up with a separate body of test data to generate a Lucene index and
unit test the code? I think this would have the double benefit of the code
being tested and showing dev/users how the code is supposed to be used.

What do you think?

Alex


On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Thanks Alex,
> This code is for processing a clinical text data corpus stored as a
> lucene index -- data that cannot be redistributed for privacy reasons.
> Since it's so related to the coref stuff I thought it should go
> alongside the coreference module. But maybe it makes more sense as an
> external project since it can't really function without externally
> created resources -- what do you think?
> Tim
>
>
> On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > Hi,
> >
> > I was trying to do a UTest for the
> > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > added),
> > but I couldn't find any of the existing resources that can be used
> > for
> > this. Can anyone help me pointing to a resource (Lucene index)
> > folder.
> >
> > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> >
> > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > \
> > index.out
> >
> > I was trying with the following lucene folder/resource:
> > ./ctakes-coreference-
> > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > _5k
> >
> > And also the dictionaries:
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_codes_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > cue_phrase_index
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> >
> > Any execution looks like:
> > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > parser...
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > Message:
> > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > maxDoc=5000
> > (got docID=5000)
> > at
> > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > Reader.java:152)
> > at
> > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > der.java:115)
> > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > at
> > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > tionReader.java:90)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > ArtifactProducer.java:494)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > actProducer.java:711)
> >
> > Collection process complete called, closing file writer.
> >
> > I appreciate any of your help,
> > Alex


Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

2017-10-02 Thread Alexandru Zbarcea
Hi Tim,

I understand, makes sense. Is it possible to anonymize the data you have or
come up with a separate body of test data to generate a Lucene index and
unit test the code? I think this would have the double benefit of the code
being tested and showing dev/users how the code is supposed to be used.

What do you think?

Alex


On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Thanks Alex,
> This code is for processing a clinical text data corpus stored as a
> lucene index -- data that cannot be redistributed for privacy reasons.
> Since it's so related to the coref stuff I thought it should go
> alongside the coreference module. But maybe it makes more sense as an
> external project since it can't really function without externally
> created resources -- what do you think?
> Tim
>
>
> On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > Hi,
> >
> > I was trying to do a UTest for the
> > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > added),
> > but I couldn't find any of the existing resources that can be used
> > for
> > this. Can anyone help me pointing to a resource (Lucene index)
> > folder.
> >
> > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> >
> > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > \
> > index.out
> >
> > I was trying with the following lucene folder/resource:
> > ./ctakes-coreference-
> > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > _5k
> >
> > And also the dictionaries:
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_codes_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > cue_phrase_index
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> >
> > Any execution looks like:
> > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > parser...
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > Message:
> > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > maxDoc=5000
> > (got docID=5000)
> > at
> > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > Reader.java:152)
> > at
> > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > der.java:115)
> > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > at
> > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > tionReader.java:90)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > ArtifactProducer.java:494)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > actProducer.java:711)
> >
> > Collection process complete called, closing file writer.
> >
> > I appreciate any of your help,
> > Alex


LuceneDictionaryImpl for cTAKES-4.0.0

2017-10-02 Thread Iker Huerga
Hi,

We have a custom decent size dictionary (~1.4M concepts) in a Lucene Index

I'd like to have an implementation of AbstractJCasTermAnnotator, e.g.
DefaultJCas, finding terms from the lucene index directly. I can think on
two options, but I'd like to get everyone's input

1- Create a hsql db containing a dictionary using an approach similar
to org.apache.ctakes.gui.dictionary,DictionaryBuilder and then some sort of
LuceneConceptFactory extending AbstractConceptFactory

2- Creating a new Dictionary Lookup, e.g. LuceneJCasTermAnnotation, similar
to DefaultJCasTermAnnotator with the signature of the findTerms method
something like this

void findTerms( IndexSearcher searcher, List allTokens)

I've seen that for cTakes v3 there was something similar in
the LuceneDictionaryImpl but that doesn't seem to work with the Fast
Dictionary Lookup approach for cTakes-4.0.0

Thanks in advance for any ideas or suggestions!
Iker


Re: CTAKES-460: coreference Test should not be part of main [EXTERNAL]

2017-10-02 Thread Alexandru Zbarcea
Thank you Tim

Alex


On Oct 2, 2017 10:43, "Miller, Timothy" <
timothy.mil...@childrens.harvard.edu> wrote:

Thanks Alex, I've committed this patch.
I unfortunately looked at the wrong tab when typing my commit message
and committed it with the wrong issue number (459).

Tim

On Mon, 2017-10-02 at 08:17 -0400, Alexandru Zbarcea wrote:
> Hi,
>
> I have refactor a main class that should have been a UTest.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.or
> g_jira_browse_CTAKES-
> 2D460=DwIBaQ=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=Heup-
> IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=T0fckwyf1n_TXQgdwCR5YlQItLlxMx
> 9nU_S5EUx1Iu0=f5ZcQqm3Dbk91cdhymh20-kg5cyZGoHPFjK0x9ZH32k=
>
> This moves the test code from src/main to src/test and also added
> some
> refactoring.
>
> No impact. Can easily be merged.
>
> Alex


RE: Enabling drugner pipeline and identifying dates [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]

2017-10-02 Thread Finan, Sean
Hi Tim,

The coreference question (just a question) was for a different item altogether. 
 Sorry for any confusion.  The reason that I CC:d you ...

From Gandhi:
> Interestingly even I was able to generate [Sean's coref output] using  piper 
> GUI by  having only that single line - " The patient started study treatment 
> of Thalomid 200mg (days 1-21), and Epirubicin, 20 mg/m2 (days 1, 8, and 15) 
> on 06/07/02 for the treatment of hepatocellular carcinoma. " in the input 
> file.
>But when I change the input file content with the following lines:   [Full 
>paragraph (below), single-sentence in middle]  The co-reference superscript is 
>lost by then.

Sean's answer:
> Ctakes is a system with many moving parts.  Things that precede or follow 
> your original example sentence will change the evaluation of that sentence.
With the pipeline you are using and the full note, you should see a number 
(mine is 4) next to the first "thalomid" in the original example sentence.  If 
you click that number you should see (to the right) 4 instances of "thalomid".
>Tim can correct me here, but maybe the coreference module ranked the links 
>between "thalomid" as much higher than the rank between "study treatment of 
>thalomid 200mg" and "the treatment of hepatocellular carcinoma" and discarded 
>the encapsulating treatment texts from markables?  It is probably more complex 
>than that.

Sean

"This patient is participating in a Non-IND study; Protocol CG-000424: "Phase 
I/II of Thalidomide and Epirubicin in Patients with Unresectable or Metastatic 
Hepatocellular Carcinoma".Information has been received from the investigator 
regarding an 82 year-old male patient who had gastrointestinal bleeding while 
on Thalomid, Epirubicin, and Coumadin. He had a past medical history of 
diverticulosis in 03/02 and a right atrial clot from intraventricular catheter 
(IVC) for which he was started on Coumadin. During the hospitalization for a 
right atrial clot in 03/02 hepatocellular carcinoma was first noted and he was 
referred to an oncologist.  The patient started study treatment of Thalomid 
200mg (days 1-21), and Epirubicin, 20 mg/m2 (days 1, 8, and 15) on 06/07/02 for 
the treatment of hepatocellular carcinoma.  He was concomitantly receiving 
Cardura, Ambien (for insomnia), Megace, Coumadin, and Oxycodone. This patient 
presented to the emergency room with the chief complaint of hematochezia. He 
reported noticing bright red blood and small clots mixed in with his stool. On 
07/13/02, he was admitted due to gastrointestinal bleed.  The physician ordered 
2 large bore intravenous lines and planned to transfuse for hematocrit less 
than 30%. Due to the  INR (international normalized ratio) level of 3.0, 
Coumadin was held. He was also noted to have bilateral lower extremity edema 
with dyspnea on exertion.  On 07/13/02, he had a chest X-ray PA and lateral 
done that showed no evidence of acute pneumonia or congestive heart failure.  
On 07/14/02, he underwent  an ultrasound which was negative for deep vein 
thrombosis. This patient did not take Thalomid on the day of his admittance to 
the hospital, but resumed treatment shortly after with no return of symptoms. 
On 07/15/02, he was discharged in stable condition. There have been no further 
reports of bleeding at this time. Thedoctor has assessed the hematochezia as 
related to Coumadin treatment and previously diagnosed diverticulosis, and not 
to protocol therapy with Thalomid and Epirubicin.Additional information 
received from the investigator on 27Aug02 reveals that this male patient began 
on 07Jun02 two cycles of therapy with Thalidomide and Epirubicin.  His post 
cycle two computed tomography scans revealed increase in size of liver lesion 
with development of multiple new satellite nodules.  On 29Jul02, the 
investigator removed this patient from protocol for progressive disease and 
recommended hospice care.  After seeking a second opinion from two other 
institutions, this patient was admitted to hospice on 05Aug02.  On 20Aug02, the 
investigator noted that this patient was suffering worsening fatigue and got 
tired getting out of his chair.  On 25Aug02, this patient died due to disease 
progression.  The investigator assessed the death as not related to study 
treatment and expected"




-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, October 02, 2017 10:36 AM
To: dev@ctakes.apache.org
Subject: Re: Enabling drugner pipeline and identifying dates [EXTERNAL] 
[SUSPICIOUS] [SUSPICIOUS]

My bad, I didn't read too closely and thought this was going to be a

coreference patch. I don't know this FSM code that well, so I am not an

expert. My biggest concern at a glance is that these additions help

find more true positives (as in your examples), can we verify that they

won't create false positives?

Tim





On Fri, 2017-09-29 at 06:25 +, Gandhi Rajan Natarajan wrote:

> Hi Sean,

> 

> Thanks again for 

Re: CTAKES-460: coreference Test should not be part of main [EXTERNAL]

2017-10-02 Thread Miller, Timothy
Thanks Alex, I've committed this patch.
I unfortunately looked at the wrong tab when typing my commit message
and committed it with the wrong issue number (459).

Tim

On Mon, 2017-10-02 at 08:17 -0400, Alexandru Zbarcea wrote:
> Hi,
> 
> I have refactor a main class that should have been a UTest.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.or
> g_jira_browse_CTAKES-
> 2D460=DwIBaQ=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=Heup-
> IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=T0fckwyf1n_TXQgdwCR5YlQItLlxMx
> 9nU_S5EUx1Iu0=f5ZcQqm3Dbk91cdhymh20-kg5cyZGoHPFjK0x9ZH32k= 
> 
> This moves the test code from src/main to src/test and also added
> some
> refactoring.
> 
> No impact. Can easily be merged.
> 
> Alex

Re: Enabling drugner pipeline and identifying dates [EXTERNAL] [SUSPICIOUS]

2017-10-02 Thread Miller, Timothy
My bad, I didn't read too closely and thought this was going to be a
coreference patch. I don't know this FSM code that well, so I am not an
expert. My biggest concern at a glance is that these additions help
find more true positives (as in your examples), can we verify that they
won't create false positives?
Tim


On Fri, 2017-09-29 at 06:25 +, Gandhi Rajan Natarajan wrote:
> Hi Sean,
> 
> Thanks again for the response. I guess its mistake from my side that
> I dint send the complete text. Did you mean that with the text I
> sent, the co-reference superscript-1 will be lost?
> 
> Also as per your advice, We have created an issue  - https://urldefen
> se.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_jira_browse_CTAKES-
> 2D459=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=Heup-
> IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=0kLxqu0Xu_2pjzCrVwxC4cd_1ubh_g
> nqCIxz6hOzUUQ=Tihsi1dyNHsqsYbwyClGANfqk2Ov2nfQL2YuIV1L0CI=   for
> measurement FSM changes and attached the modified file changes. Could
> someone have a look and know your thoughts please?
> 
> Regards,
> Gandhi
> 
> 
> -Original Message-
> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
> Sent: Thursday, September 28, 2017 8:21 PM
> To: dev@ctakes.apache.org
> Cc: Miller, Timothy 
> Subject: RE: Enabling drugner pipeline and identifying dates
> [EXTERNAL] [SUSPICIOUS]
> 
> Hi Gandhi,
> 
> I don't recall you sending me that entire snippet of text.  I think
> that I only had your single example sentence.
> You have discovered one of the quirks of software: "change the data,
> change the result."
> Ctakes is a system with many moving parts.  Things that precede or
> follow your original example sentence will change the evaluation of
> that sentence.
> With the pipeline you are using and the full note, you should see a
> number (mine is 4) next to the first "thalomid" in the original
> example sentence.  If you click that number you should see (to the
> right) 4 instances of "thalomid".
> Tim can correct me here, but maybe the coreference module ranked the
> links between "thalomid" as much higher than the rank between "study
> treatment of thalomid 200mg" and "the treatment of hepatocellular
> carcinoma" and discarded the encapsulating treatment texts from
> markables?  It is probably more complex than that.
> 
> > 
> > we have also made some code changes in MeasurementFSM.java to
> > identify certain measurements like '20 mg/m2' which was not
> > identified out of the box.  Should we send the code changes to you
> > so that you can consider the same to be productized ? Please
> > advise."
> I don't know if you've noticed the recent emails on the dev list
> involving Alexandru Zbarcea.  Alex has been creating or commenting on
> Jira items and attaching code for  fixes and enhancements.  This is a
> widely used process and is fairly easy to follow.   I think that the
> following links are relevant:
> Working with issues:  https://urldefense.proofpoint.com/v2/url?u=http
> s-3A__confluence.atlassian.com_jiracoreserver073_working-2Dwith-
> 2Dissues-
> 2D861257307.html=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe
> FU=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=0kLxqu0Xu_2pjzCrVwxC4cd_1ubh_g
> nqCIxz6hOzUUQ=Fo-LGlsEfYJpgYcWvrDmor0B3YGxx5brZLelntVMxrU= 
> Creating patches:   https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__confluence.atlassian.com_crucible_creating-2Dpatch-2Dfiles-2Dfor-
> 2Dpre-2Dcommit-2Dreviews-
> 2D298977458.html=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe
> FU=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=0kLxqu0Xu_2pjzCrVwxC4cd_1ubh_g
> nqCIxz6hOzUUQ=wVhEQCU73iEplHm34bO2AtgaDUpjAvrFe4GFx5b6pYo= 
> Attaching files:   https://urldefense.proofpoint.com/v2/url?u=https-3
> A__confluence.atlassian.com_jiracorecloud_attaching-2Dfiles-2Dand-
> 2Dscreenshots-2Dto-2Dissues-
> 2D765593805.html=DwIFAg=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe
> FU=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=0kLxqu0Xu_2pjzCrVwxC4cd_1ubh_g
> nqCIxz6hOzUUQ=eO_HZCkkeOg8jF3CMYnMxttXRHSM16qdwPl5nTW48zQ= 
> 
> I don't know if you have a jira account and permissions for the
> ctakes project.  An administrator may need to set that up for you.
> 
> Thanks,
> Sean
> 
> -Original Message-
> From: Gandhi Rajan Natarajan [mailto:gandhi.natara...@arisglobal.com]
> Sent: Thursday, September 28, 2017 4:09 AM
> To: dev@ctakes.apache.org
> Subject: RE: Enabling drugner pipeline and identifying dates
> [EXTERNAL] [SUSPICIOUS]
> 
> Hi Sean,
> 
> Thanks for the response. I was able to see the co-reference
> superscript using the html file that you sent. Interestingly even I
> was able to generate the sample HTML using  piper GUI by  having only
> that single line - " The patient started study treatment of Thalomid
> 200mg (days 1-21), and Epirubicin, 20 mg/m2 (days 1, 8, and 15) on
> 06/07/02 for the 

Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]

2017-10-02 Thread Miller, Timothy
Thanks Alex,
This code is for processing a clinical text data corpus stored as a
lucene index -- data that cannot be redistributed for privacy reasons.
Since it's so related to the coref stuff I thought it should go
alongside the coreference module. But maybe it makes more sense as an
external project since it can't really function without externally
created resources -- what do you think?
Tim


On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> Hi,
> 
> I was trying to do a UTest for the
> org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> added),
> but I couldn't find any of the existing resources that can be used
> for
> this. Can anyone help me pointing to a resource (Lucene index)
> folder.
> 
> org.apache.ctakes.coreference.data.PrintMimicMarkables \
> 
> /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> \
> index.out
> 
> I was trying with the following lucene folder/resource:
> ./ctakes-coreference-
> res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> _5k
> 
> And also the dictionaries:
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> like_codes_sample
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> cue_phrase_index
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> like_sample
> ./ctakes-dictionary-lookup-
> res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> 
> Any execution looks like:
> 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> parser...
> Oct 01, 2017 7:50:20 PM
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> Message:
> docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> Oct 01, 2017 7:50:20 PM
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> java.lang.IllegalArgumentException: docID must be >= 0 and <
> maxDoc=5000
> (got docID=5000)
> at
> org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> Reader.java:152)
> at
> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> der.java:115)
> at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> at
> org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> tionReader.java:90)
> at
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> ArtifactProducer.java:494)
> at
> org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> actProducer.java:711)
> 
> Collection process complete called, closing file writer.
> 
> I appreciate any of your help,
> Alex

CTAKES-460: coreference Test should not be part of main

2017-10-02 Thread Alexandru Zbarcea
Hi,

I have refactor a main class that should have been a UTest.
https://issues.apache.org/jira/browse/CTAKES-460

This moves the test code from src/main to src/test and also added some
refactoring.

No impact. Can easily be merged.

Alex