Yeah, it might be nice to build a lucene index of all the sample notes in the 
ctakes-example module. I'll create a jira for it but probably won't be able to 
get to it right away.
Tim

________________________________________
From: Alexandru Zbarcea <al...@apache.org>
Sent: Monday, October 2, 2017 5:31 PM
To: Apache cTAKES Dev
Subject: Re: Missing resources for script that extracts markables from a corpus 
for analysis [EXTERNAL]

Hi Tim,

I understand, makes sense. Is it possible to anonymize the data you have or
come up with a separate body of test data to generate a Lucene index and
unit test the code? I think this would have the double benefit of the code
being tested and showing dev/users how the code is supposed to be used.

What do you think?

Alex


On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> Thanks Alex,
> This code is for processing a clinical text data corpus stored as a
> lucene index -- data that cannot be redistributed for privacy reasons.
> Since it's so related to the coref stuff I thought it should go
> alongside the coreference module. But maybe it makes more sense as an
> external project since it can't really function without externally
> created resources -- what do you think?
> Tim
>
>
> On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote:
> > Hi,
> >
> > I was trying to do a UTest for the
> > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently
> > added),
> > but I couldn't find any of the existing resources that can be used
> > for
> > this. Can anyone help me pointing to a resource (Lucene index)
> > folder.
> >
> > org.apache.ctakes.coreference.data.PrintMimicMarkables \
> >
> > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup-
> > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index
> > \
> >     index.out
> >
> > I was trying with the following lucene folder/resource:
> > ./ctakes-coreference-
> > res/src/main/resources/org/apache/ctakes/coreference/models/index_med
> > _5k
> >
> > And also the dictionaries:
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_codes_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_
> > cue_phrase_index
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed-
> > like_sample
> > ./ctakes-dictionary-lookup-
> > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index
> >
> > Any execution looks like:
> > 01 Oct 2017 19:50:19  INFO ConstituencyParser - Initializing
> > parser...
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process
> > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::)
> > Message:
> > docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > Oct 01, 2017 7:50:20 PM
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820)
> > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000)
> > java.lang.IllegalArgumentException: docID must be >= 0 and <
> > maxDoc=5000
> > (got docID=5000)
> > at
> > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite
> > Reader.java:152)
> > at
> > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea
> > der.java:115)
> > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> > at
> > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec
> > tionReader.java:90)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(
> > ArtifactProducer.java:494)
> > at
> > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif
> > actProducer.java:711)
> >
> > Collection process complete called, closing file writer.
> >
> > I appreciate any of your help,
> > Alex

Reply via email to