Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
Hi Tim, Because LuceneIndex is touched in several places within the code, I started with refactorization of LuceneIndexReaderResourceImpl (see: CTAKES-464 [1]) If you have time, may you also check CTAKES-334 [2]. I started to have it as a prerequisite, because the patch provided actually will make the tests pass (having also UMLS credentials). Alex [1] - https://issues.apache.org/jira/browse/CTAKES-464 [2] - https://issues.apache.org/jira/browse/CTAKES-334 On Wed, Oct 4, 2017 at 8:15 AM, Alexandru Zbarceawrote: > Thanks Tim, > > I will let you know about the progress. > > Alex > > On Oct 4, 2017 06:34, "Miller, Timothy" harvard.edu> wrote: > >> I had in mind the notes in: >> /ctakes-examples-res/src/main/resources/org/apache/ctakes/ex >> amples/notes/rtf >> >> which I believe are the fake notes Dr. John Green wrote for us. I don't >> know why they are rtf but they are nice, non-toy-length notes. >> Tim >> >> >> From: Alexandru Zbarcea >> Sent: Tuesday, October 3, 2017 5:32 PM >> To: Apache cTAKES Dev >> Subject: Re: Missing resources for script that extracts markables from a >> corpus for analysis [EXTERNAL] >> >> Hi Tim, >> >> That's great news. If you think there are sample notes that can be used, I >> can start working on the Lucene index and slowly build the UTest for them. >> >> I have created CTAKES-462[1] where we can track this work. >> >> Looking into the ctakes-examples-res, what I can find is: >> $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" | >> grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper" >> ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt >> ./main/resources/org/apache/ctakes/examples/notes/claudication >> ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt >> ./main/resources/org/apache/ctakes/examples/notes/edge_cases >> _plaintext_1.txt >> >> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt >> ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy >> ./main/resources/org/apache/ctakes/examples/notes/SampleInpu >> tRadiologyNotes.txt >> >> ./main/resources/org/apache/ctakes/examples/notes/smoker/ >> doc1_07543210_sample_past_smoker.txt >> ./main/resources/org/apache/ctakes/examples/notes/smoker/ >> doc2_07543210_sample_past_smoker.txt >> ./main/resources/org/apache/ctakes/examples/notes/smoker/ >> doc2_07543210_sample_current.txt >> ./main/resources/org/apache/ctakes/examples/notes/smoker/ >> doc1_07543210_sample_unknown.txt >> ./main/resources/org/apache/ctakes/examples/notes/smoker/ >> doc1_07543210_sample_current.txt >> ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README >> ./main/resources/org/apache/ctakes/examples/notes/mother_ >> goose/OneMistyMoistyMorning.txt >> ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1 >> ./main/resources/org/apache/ctakes/examples/annotation/ >> anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1 >> >> What notes do you consider I should start with (all) ? >> >> Alex >> >> [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues. >> apache.org_jira_browse_CTAKES-2D462=DwIBaQ=qS4goWBT7popl >>
Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
Thanks Tim, I will let you know about the progress. Alex On Oct 4, 2017 06:34, "Miller, Timothy" < timothy.mil...@childrens.harvard.edu> wrote: > I had in mind the notes in: > /ctakes-examples-res/src/main/resources/org/apache/ctakes/ > examples/notes/rtf > > which I believe are the fake notes Dr. John Green wrote for us. I don't > know why they are rtf but they are nice, non-toy-length notes. > Tim > > > From: Alexandru Zbarcea> Sent: Tuesday, October 3, 2017 5:32 PM > To: Apache cTAKES Dev > Subject: Re: Missing resources for script that extracts markables from a > corpus for analysis [EXTERNAL] > > Hi Tim, > > That's great news. If you think there are sample notes that can be used, I > can start working on the Lucene index and slowly build the UTest for them. > > I have created CTAKES-462[1] where we can track this work. > > Looking into the ctakes-examples-res, what I can find is: > $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" | > grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper" > ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt > ./main/resources/org/apache/ctakes/examples/notes/claudication > ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt > ./main/resources/org/apache/ctakes/examples/notes/edge_ > cases_plaintext_1.txt > > ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt > ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy > ./main/resources/org/apache/ctakes/examples/notes/ > SampleInputRadiologyNotes.txt > > ./main/resources/org/apache/ctakes/examples/notes/smoker/ > doc1_07543210_sample_past_smoker.txt > ./main/resources/org/apache/ctakes/examples/notes/smoker/ > doc2_07543210_sample_past_smoker.txt > ./main/resources/org/apache/ctakes/examples/notes/smoker/ > doc2_07543210_sample_current.txt > ./main/resources/org/apache/ctakes/examples/notes/smoker/ > doc1_07543210_sample_unknown.txt > ./main/resources/org/apache/ctakes/examples/notes/smoker/ > doc1_07543210_sample_current.txt > ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README > ./main/resources/org/apache/ctakes/examples/notes/mother_ > goose/OneMistyMoistyMorning.txt > ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1 > ./main/resources/org/apache/ctakes/examples/annotation/ > anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1 > > What notes do you consider I should start with (all) ? > > Alex > > [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues. > apache.org_jira_browse_CTAKES-2D462=DwIBaQ=qS4goWBT7poplM69zy_ > 3xhKwEW14JZMSdioCoppxeFU=Heup-IbsIg9Q1TPOylpP9FE4GTK- > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=COSkyBpYGrcp_ > hTAFRRfTx8JCwHAzxTM3GMiXKrSbnE=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40v > MCW0Bgk= > > > On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy harvard.edu> wrote: > > > Yeah, it might be nice to build a lucene index of all the sample notes in > > the ctakes-example module. I'll create a jira for it but probably won't > be > > able to get to it right away. > > Tim > > > > > > From: Alexandru Zbarcea > > Sent: Monday, October 2, 2017 5:31 PM > > To: Apache cTAKES Dev > > Subject: Re:
Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
I had in mind the notes in: /ctakes-examples-res/src/main/resources/org/apache/ctakes/examples/notes/rtf which I believe are the fake notes Dr. John Green wrote for us. I don't know why they are rtf but they are nice, non-toy-length notes. Tim From: Alexandru ZbarceaSent: Tuesday, October 3, 2017 5:32 PM To: Apache cTAKES Dev Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL] Hi Tim, That's great news. If you think there are sample notes that can be used, I can start working on the Lucene index and slowly build the UTest for them. I have created CTAKES-462[1] where we can track this work. Looking into the ctakes-examples-res, what I can find is: $ find . -type f | grep -v "\.class" | grep -v "\.iml" | grep -v "\.jar" | grep -v "\.rtf" | grep -v "\.xml" | grep -v "\.bsv" | grep -v "\.piper" ./main/resources/org/apache/ctakes/examples/notes/pain_no_swelling.txt ./main/resources/org/apache/ctakes/examples/notes/claudication ./main/resources/org/apache/ctakes/examples/notes/shark_bite.txt ./main/resources/org/apache/ctakes/examples/notes/edge_cases_plaintext_1.txt ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_1.txt ./main/resources/org/apache/ctakes/examples/notes/right_knee_arthroscopy ./main/resources/org/apache/ctakes/examples/notes/SampleInputRadiologyNotes.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_past_smoker.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc2_07543210_sample_past_smoker.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc2_07543210_sample_current.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_unknown.txt ./main/resources/org/apache/ctakes/examples/notes/smoker/ doc1_07543210_sample_current.txt ./main/resources/org/apache/ctakes/examples/notes/mother_goose/README ./main/resources/org/apache/ctakes/examples/notes/mother_ goose/OneMistyMoistyMorning.txt ./main/resources/org/apache/ctakes/examples/notes/dr_nutritious_2.txt ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_RoutBirthNote_1/Peds_RoutBirthNote_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_AAA_Leak_1/VascSurg_AAA_Leak_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_Dysphagia_1/Peds_Dysphagia_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_LaborProgressNote_1/OBGYN_LaborProgressNote_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_IUD_1/OBGYN_IUD_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_HysterectomyAndBSO_1/OBGYN_HysterectomyAndBSO_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_FollowUp_1/VascSurg_FollowUp_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_PROMCheck_1/OBGYN_PROMCheck_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_Gen_Abscess_1/OBGYN_Gen_Abscess_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/Peds_FebrileSez_1/Peds_FebrileSez_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_RO_AAA_1/VascSurg_RO_AAA_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_RO_DVT_1/VascSurg_RO_DVT_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/GenSurg_UmbilicalHernia_1/GenSurg_UmbilicalHernia_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/VascSurg_PVD_1/VascSurg_PVD_1 ./main/resources/org/apache/ctakes/examples/annotation/ anafora_annotated/OBGYN_MVAPrego_1/OBGYN_MVAPrego_1 What notes do you consider I should start with (all) ? Alex [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CTAKES-2D462=DwIBaQ=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU=Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h=COSkyBpYGrcp_hTAFRRfTx8JCwHAzxTM3GMiXKrSbnE=jOmot_onPFb31eg689D0ihb5Y4dZTzKcQ40vMCW0Bgk= On Mon, Oct 2, 2017 at 6:46 PM, Miller, Timothy wrote: > Yeah, it might be nice to build a lucene index of all the sample notes in > the ctakes-example module. I'll create a jira for it but probably won't be > able to get to it right away. > Tim > > > From: Alexandru Zbarcea > Sent: Monday, October 2, 2017 5:31 PM > To: Apache cTAKES Dev > Subject: Re: Missing resources for script that extracts markables from a > corpus for analysis [EXTERNAL] > > Hi Tim, > > I understand, makes sense. Is it possible to anonymize the data you have or > come up with a separate body of test data to generate a Lucene index and > unit test the code? I think this would have the double benefit of the code > being tested and
Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
Yeah, it might be nice to build a lucene index of all the sample notes in the ctakes-example module. I'll create a jira for it but probably won't be able to get to it right away. Tim From: Alexandru ZbarceaSent: Monday, October 2, 2017 5:31 PM To: Apache cTAKES Dev Subject: Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL] Hi Tim, I understand, makes sense. Is it possible to anonymize the data you have or come up with a separate body of test data to generate a Lucene index and unit test the code? I think this would have the double benefit of the code being tested and showing dev/users how the code is supposed to be used. What do you think? Alex On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy < timothy.mil...@childrens.harvard.edu> wrote: > Thanks Alex, > This code is for processing a clinical text data corpus stored as a > lucene index -- data that cannot be redistributed for privacy reasons. > Since it's so related to the coref stuff I thought it should go > alongside the coreference module. But maybe it makes more sense as an > external project since it can't really function without externally > created resources -- what do you think? > Tim > > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote: > > Hi, > > > > I was trying to do a UTest for the > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently > > added), > > but I couldn't find any of the existing resources that can be used > > for > > this. Can anyone help me pointing to a resource (Lucene index) > > folder. > > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \ > > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup- > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index > > \ > > index.out > > > > I was trying with the following lucene folder/resource: > > ./ctakes-coreference- > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med > > _5k > > > > And also the dictionaries: > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_codes_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_ > > cue_phrase_index > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index > > > > Any execution looks like: > > 01 Oct 2017 19:50:19 INFO ConstituencyParser - Initializing > > parser... > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) > > Message: > > docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820) > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > java.lang.IllegalArgumentException: docID must be >= 0 and < > > maxDoc=5000 > > (got docID=5000) > > at > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite > > Reader.java:152) > > at > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea > > der.java:115) > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) > > at > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec > > tionReader.java:90) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext( > > ArtifactProducer.java:494) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif > > actProducer.java:711) > > > > Collection process complete called, closing file writer. > > > > I appreciate any of your help, > > Alex
Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
Hi Tim, I understand, makes sense. Is it possible to anonymize the data you have or come up with a separate body of test data to generate a Lucene index and unit test the code? I think this would have the double benefit of the code being tested and showing dev/users how the code is supposed to be used. What do you think? Alex On Mon, Oct 2, 2017 at 9:45 AM, Miller, Timothy < timothy.mil...@childrens.harvard.edu> wrote: > Thanks Alex, > This code is for processing a clinical text data corpus stored as a > lucene index -- data that cannot be redistributed for privacy reasons. > Since it's so related to the coref stuff I thought it should go > alongside the coreference module. But maybe it makes more sense as an > external project since it can't really function without externally > created resources -- what do you think? > Tim > > > On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote: > > Hi, > > > > I was trying to do a UTest for the > > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently > > added), > > but I couldn't find any of the existing resources that can be used > > for > > this. Can anyone help me pointing to a resource (Lucene index) > > folder. > > > > org.apache.ctakes.coreference.data.PrintMimicMarkables \ > > > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup- > > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index > > \ > > index.out > > > > I was trying with the following lucene folder/resource: > > ./ctakes-coreference- > > res/src/main/resources/org/apache/ctakes/coreference/models/index_med > > _5k > > > > And also the dictionaries: > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_codes_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_ > > cue_phrase_index > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > > like_sample > > ./ctakes-dictionary-lookup- > > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index > > > > Any execution looks like: > > 01 Oct 2017 19:50:19 INFO ConstituencyParser - Initializing > > parser... > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process > > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) > > Message: > > docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > Oct 01, 2017 7:50:20 PM > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820) > > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000) > > java.lang.IllegalArgumentException: docID must be >= 0 and < > > maxDoc=5000 > > (got docID=5000) > > at > > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite > > Reader.java:152) > > at > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea > > der.java:115) > > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) > > at > > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec > > tionReader.java:90) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext( > > ArtifactProducer.java:494) > > at > > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif > > actProducer.java:711) > > > > Collection process complete called, closing file writer. > > > > I appreciate any of your help, > > Alex
Re: Missing resources for script that extracts markables from a corpus for analysis [EXTERNAL]
Thanks Alex, This code is for processing a clinical text data corpus stored as a lucene index -- data that cannot be redistributed for privacy reasons. Since it's so related to the coref stuff I thought it should go alongside the coreference module. But maybe it makes more sense as an external project since it can't really function without externally created resources -- what do you think? Tim On Sun, 2017-10-01 at 19:54 -0400, Alexandru Zbarcea wrote: > Hi, > > I was trying to do a UTest for the > org.apache.ctakes.coreference.data.PrintMimicMarkables (recently > added), > but I couldn't find any of the existing resources that can be used > for > this. Can anyone help me pointing to a resource (Lucene index) > folder. > > org.apache.ctakes.coreference.data.PrintMimicMarkables \ > > /home/alex/projects/apache/ctakes/ctakes-dictionary-lookup- > res/target/classes/org/apache/ctakes/dictionary/lookup/rxnorm_index > \ > index.out > > I was trying with the following lucene folder/resource: > ./ctakes-coreference- > res/src/main/resources/org/apache/ctakes/coreference/models/index_med > _5k > > And also the dictionaries: > ./ctakes-dictionary-lookup- > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > like_codes_sample > ./ctakes-dictionary-lookup- > res/src/main/resources/org/apache/ctakes/dictionary/lookup/assertion_ > cue_phrase_index > ./ctakes-dictionary-lookup- > res/src/main/resources/org/apache/ctakes/dictionary/lookup/OrangeBook > ./ctakes-dictionary-lookup- > res/src/main/resources/org/apache/ctakes/dictionary/lookup/snomed- > like_sample > ./ctakes-dictionary-lookup- > res/src/main/resources/org/apache/ctakes/dictionary/lookup/drug_index > > Any execution looks like: > 01 Oct 2017 19:50:19 INFO ConstituencyParser - Initializing > parser... > Oct 01, 2017 7:50:20 PM > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer process > WARNING: Got Exception. (Thread Name: [CollectionReader Thread]::) > Message: > docID must be >= 0 and < maxDoc=5000 (got docID=5000) > Oct 01, 2017 7:50:20 PM > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer run(820) > WARNING: docID must be >= 0 and < maxDoc=5000 (got docID=5000) > java.lang.IllegalArgumentException: docID must be >= 0 and < > maxDoc=5000 > (got docID=5000) > at > org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseComposite > Reader.java:152) > at > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeRea > der.java:115) > at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) > at > org.apache.ctakes.core.cr.LuceneCollectionReader.getNext(LuceneCollec > tionReader.java:90) > at > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext( > ArtifactProducer.java:494) > at > org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(Artif > actProducer.java:711) > > Collection process complete called, closing file writer. > > I appreciate any of your help, > Alex