I had in mind the notes in: /ctakes-examples-res/src/main/resources/org/apache/ctakes/examples/notes/rtf
which I believe are the fake notes Dr. John Green wrote for us. I don't know why they are rtf but they are nice, non-toy-length notes. Tim ________________________________________ From: Alexandru Zbarcea <al...@apache.org> Sent: Tuesday, October 3, 2017 5:32 PM To: Apache cTAKES Dev Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL] Hi Tim, That's great news. If you think there are sample notes that can be used, I can start working on the Lucene index and slowly build the UTest for them. I have created CTAKES-462[1] where we can track this work. Looking into the ctakes-examples-res, what I can find is: $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" | grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper" ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt ./main/resources/org/apache/ctakes/examples/notes/claudication ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt ./main/resources/org/apache/ctakes/examples/notes/edge_cases_plaintext_1.txt ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy ./main/resources/org/apache/ctakes/examples/notes/SampleInputRadiologyNotes.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_past_smoker.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc2_07543210_sample_past_smoker.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc2_07543210_sample_current.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_unknown.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_current.txt ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README ./main/resources/org/apache/ctakes/examples/notes/mother_ goose/OneMistyMoistyMorning.txt ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1 What notes do you consider I should start with (all) ? Alex [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D462&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=COSkyBpYGrcp_hTAFRRfTx8JCwHAzxTM3GMiXKrSbnE&s=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40vMCW0Bgk&e= On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy <Timothy.Miller@childrens. harvard.edu> wrote: > Yeah, it might be nice to build a lucene index of all the sample notes in > the ctakes-example module. I'll create a jira for it but probably won't be > able to get to it right away. > Tim > > ________________________________________ > From: Alexandru Zbarcea <al...@apache.org> > Sent: Monday, October 2, 2017 5:31 PM > To: Apache cTAKES Dev > Subject: Re: Missing resources for script that extracts markables from a > corpus for analysis [EXTERNAL] > > Hi Tim, > > I understand, makes sense. Is it possible to anonymize the data you have or > come up with a separate body of test data to generate a Lucene index and > unit test the code? I think this would have the double benefit of the code > being tested and showing dev/users how the code is supposed to be used. > > What do you think? > > Alex > > > On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu> wrote: > > > Thanks Alex, > > This code is for processing a clinical text data corpus stored as a > > lucene index -- data that cannot be redistributed for privacy reasons. > > Since it's so related to the coref stuff I thought it should go > > alongside the coreference module. But maybe it makes more sense as an > > external project since it can't really function without externally > > created resources -- what do you think? > > Tim > > > > > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote: > > > Hi, > > > > > > I was trying to do a UTest for the > > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently > > > added), > > > but I couldn't find any of the existing resources that can be used > > > for > > > this. Can anyone help me pointing to a resource (Lucene index) > > > folder. > > > > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \ > > > > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup- > > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index > > > \ > > > index.out > > > > > > I was trying with the following lucene folder/resource: > > > ./ctakes-coreference- > > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med > > > _5k > > > > > > And also the dictionaries: > > > ./ctakes-dictionary-lookup- > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > > like_codes_sample > > > ./ctakes-dictionary-lookup- > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_ > > > cue_phrase_index > > > ./ctakes-dictionary-lookup- > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook > > > ./ctakes-dictionary-lookup- > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > > like_sample > > > ./ctakes-dictionary-lookup- > > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index > > > > > > Any execution looks like: > > > 01 Oct 2017 19:50:19 INFO ConstituencyParser - Initializing > > > parser... > > > Oct 01, 2017 7:50:20 PM > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process > > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) > > > Message: > > > docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > > Oct 01, 2017 7:50:20 PM > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820) > > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > > java.lang.IllegalArgumentException: docID must be >= 0 and < > > > maxDoc=5000 > > > (got docID=5000) > > > at > > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite > > > Reader.java:152) > > > at > > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea > > > der.java:115) > > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) > > > at > > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec > > > tionReader.java:90) > > > at > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext( > > > ArtifactProducer.java:494) > > > at > > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif > > > actProducer.java:711) > > > > > > Collection process complete called, closing file writer. > > > > > > I appreciate any of your help, > > > Alex >