Thanks, Sean and Peter. I tried re-writing the xml file with the editor 
(removing non UTF8 BOM), and also by

sed '1s/^\xEF\xBB\xBF//'

None of these seem to help.

I'm not sure this is the issue with the xml files or the 3rd party xml parser 
(xerces).

I'd appreciate any insights. Processing CDA documents is critical for us -- we 
are converting 350M notes into CDA.

Thanks!

Masoud



________________________________
From: Peter Abramowitsch <[email protected]>
Sent: Wednesday, December 18, 2019 3:33 PM
To: [email protected]
Subject: Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL] [SUSPICIOUS]

The problem could be a non UTF8 BOM character as the first character in a
file.  Try opening the XML file in a unicode agnostic editor that allows
for different encodings  and then re-write it in US ASCII.

https://en.wikipedia.org/wiki/Byte_order_mark

Peter

On Wed, Dec 18, 2019 at 11:31 AM Finan, Sean <
[email protected]> wrote:

> Sorry - I missed this:
> > I'm using the two CDA files that come with the cTAKES package
> (testpatient_cn_2.xml and testpatient_cn_1.xml compatible with
> NotesIIST_RTF.DTD
>
> Those files -should- be ok as they were originally used to test the CDA
> workflow.
>
> The code for CdaCasInitializer and ClinicalNotePreProcessor hasn't changed
> since 2015.
>
> The actual error is coming from the 3rd party xml parser (xerces):
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
>
> I am not sure what would be causing this.
>
> I don't run CDA, so I can't speak to the operational status of those
> components or the pipeline in general.
>
> Does anybody else out there use CDA?
>
> Sean
>
>
> ________________________________________
> From: Finan, Sean <[email protected]>
> Sent: Wednesday, December 18, 2019 2:22 PM
> To: [email protected]
> Subject: Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL] [SUSPICIOUS]
>
> * External Email - Caution *
>
>
> Hi Masoud,
>
> I am not an xml expert, so take this with a grain of salt.
>
> I think that something is wrong/unmatched with the first line of your xml
> document.
> Make sure that the first line is something like:
> <?xml version="1.0" encoding="utf-8"?>
>
> Sean
>
> ________________________________________
> From: Masoud Rouhizadeh <[email protected]>
> Sent: Wednesday, December 18, 2019 1:47 PM
> To: [email protected]
> Subject: Re: cTAKES handling HL7 CDA Level 1 [EXTERNAL]
>
> * External Email - Caution *
>
>
> Hi all,
>
> I'm using cTAKES user to process CDA documents by
> AggregateCdaProcessor.xml and AggregateCdaUMLSProcessor.xml located in
> /desc/ctakes-clinical-pipeline/desc/analysis_engine/
>
> My script to call this is
>
> java -Dctakes.umlsuser= -Dctakes.umlspw= -cp
> $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/
> -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g
> org.apache.ctakes.core.cpe.CmdLineCpeRunner
> $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/test_cda_masoud.xml
>
> test_cda_masoud.xml has a proper path to CDA input and output. I'm using
> the two CDA files that come with the cTAKES package (testpatient_cn_2.xml
> and testpatient_cn_1.xml compatible with NotesIIST_RTF.DTD).
>
> Unfortunately, it seems that CdaCasInitializer cannot run, and I get the
> attached errors. I get the same errors when using the GUI with
> AggregateCdaProcessor AE
>
> - Am I missing something obvious?
> - Does cTAKES *User* installation handle CDA documents?
> - Is org.apache.ctakes.core.cpe.CmdLineCpeRunner an appropriate pipeline
> for CdaCasInitializer?
>
> Thank you so much for your help in advance.
>
> Masoud
>
>
>
>
>
>
>
> On 11/8/19, 8:30 AM, "Finan, Sean" <[email protected]>
> wrote:
>
>
>     Hi Masoud,
>
>     I think that the CdaCasInitializer is at least 10 years old.  I would
> not expect it to conform to any recent standards.
>
>     Does anybody else have a reader or transformer that can handle HL7 CDA
> r2?
>
>     Sean
>
>     p.s.
>     If anybody is involved with HL7 International, you may want to get
> some movement on addressing the typo on the page header(s):
>
>     Section 1a: Clinical Document Architcture (CDA®)
>
>     ________________________________________
>     From: Masoud Rouhizadeh <[email protected]>
>     Sent: Thursday, November 7, 2019 5:59 PM
>     To: [email protected]
>     Subject: cTAKES handling HL7 CDA Level 1 [EXTERNAL]
>
>     Dear cTAKES developer mailing list,
>
>     We have been working on a project at Hopkins for converting
> Epic-generated RTF notes into Clinical Document Architecture Level One.
>
>     We have been using HL7 CDA® Release 2 Schema, and now we plan to use
> cTAKES for concept extraction from those documents. The CDA Schema and
> examples can be found here
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.hl7.org_implement_standards_product-5Fbrief.cfm-3Fproduct-5Fid-3D7&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=h8q4BiKKL6eDBOGEta7gcpkDGIx5xFPlGrNfUPlzBuc&s=l8HjgDHeywmdkSUkOJBGWNLpJ-bPlw7Lmgzh02w8k2s&e=
>
>     In the cTAKES documentation, I see that CdaCasInitializer "does not
> handle all CDA documents. The CDA document must conform to the DTD
> resources/cda/NotesIIST_RTF.DTD."
>
>     Has anyone tested and evaluated cTAKES ability to consume HL7 CDA
> Level 1 Release 2 documents?
>
>     Thank you,
>     Masoud
>
>     ----
>     Masoud Rouhizadeh, PhD
>     Faculty - Division of Health Science Informatics (DHSI)
>     NLP Lead - Institute for Clinical and Translational Research (ICTR)
>     Johns Hopkins University School of Medicine
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=h8q4BiKKL6eDBOGEta7gcpkDGIx5xFPlGrNfUPlzBuc&s=8fvrQoIy8orWYKCJoob5Z0Sbbioe5xyiN7pDMTzImOc&e=
>
>
>
>
>

Reply via email to