Yeah, it might be nice to build a lucene index of all the sample notes in the ctakes-example module. I'll create a jira for it but probably won't be able to get to it right away. Tim
________________________________________ From: Alexandru Zbarcea <al...@apache.org> Sent: Monday, October 2, 2017 5:31 PM To: Apache cTAKES Dev Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL] Hi Tim, I understand, makes sense. Is it possible to anonymize the data you have or come up with a separate body of test data to generate a Lucene index and unit test the code? I think this would have the double benefit of the code being tested and showing dev/users how the code is supposed to be used. What do you think? Alex On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy < timothy.mil...@childrens.harvard.edu> wrote: > Thanks Alex, > This code is for processing a clinical text data corpus stored as a > lucene index -- data that cannot be redistributed for privacy reasons. > Since it's so related to the coref stuff I thought it should go > alongside the coreference module. But maybe it makes more sense as an > external project since it can't really function without externally > created resources -- what do you think? > Tim > > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote: > > Hi, > > > > I was trying to do a UTest for the > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently > > added), > > but I couldn't find any of the existing resources that can be used > > for > > this. Can anyone help me pointing to a resource (Lucene index) > > folder. > > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \ > > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup- > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index > > \ > > index.out > > > > I was trying with the following lucene folder/resource: > > ./ctakes-coreference- > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med > > _5k > > > > And also the dictionaries: > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_codes_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_ > > cue_phrase_index > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index > > > > Any execution looks like: > > 01 Oct 2017 19:50:19 INFO ConstituencyParser - Initializing > > parser... > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) > > Message: > > docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820) > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > java.lang.IllegalArgumentException: docID must be >= 0 and < > > maxDoc=5000 > > (got docID=5000) > > at > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite > > Reader.java:152) > > at > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea > > der.java:115) > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) > > at > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec > > tionReader.java:90) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext( > > ArtifactProducer.java:494) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif > > actProducer.java:711) > > > > Collection process complete called, closing file writer. > > > > I appreciate any of your help, > > Alex