I implemented a fairly general XML collection reader using a SAX parser
that takes a handler resource that can implement the necessary logic for
dealing with the idiosyncrasies of different encoding schemes.

It was originally based on DKPro's XML readers, which are also very easy to
adapt to different formats.

Mine is available here:
https://github.com/jtatria/lector/tree/master/src/main/java/edu/columbia/incite/uima/io

It uses two components to implement a given format's logic: A "TextFilter"
resource to normalize SOFA text from XML character data and a
"MappingProvider" that implements the logic needed to process XML elements
(typically by mapping them to UIMA annotations).

If you already have coded all the logic for dealing with your source
material, it should not be too hard to adapt it for use with these
components.

Hope it is of some use. I'd be happy to answer any questions you may have.

best,
jta

On Fri, Feb 22, 2019 at 9:17 AM Bonnie MacKellar <bkmackel...@gmail.com>
wrote:

> Thanks so much!
>
> Bonnie MacKellar
>
> On Fri, Feb 22, 2019 at 7:03 AM Erik Fäßler <erik.faess...@uni-jena.de>
> wrote:
>
> > Hey,
> >
> > just wanted to say that I didn’t come around to make the component
> > available yet, will do first thing next week!
> >
> > Best,
> >
> > Erik
> >
> > > On 20. Feb 2019, at 19:47, Bonnie MacKellar <bkmackel...@gmail.com>
> > wrote:
> > >
> > > Hi,
> > > Yes, we are using that format. I have a parser that I wrote, but it
> isn't
> > > integrated into UIMA. It runs separately and loads the full clinical
> > trial
> > > data into a triplestore (Stardog). I would be interested in your system
> > > since I am not really familiar with how to write file readers in the
> UMIA
> > > framework. Perhaps I can merge my parser into it and end up with just
> the
> > > right thing. If you can make it available, I would definitely be
> > > interested.  I will take a look at the other links as well.  Thanks!!
> > >
> > > Bonnie MacKellar
> > >
> > > On Wed, Feb 20, 2019 at 3:54 AM Erik Fäßler <erik.faess...@uni-jena.de
> >
> > > wrote:
> > >
> > >> Dear Bonnie,
> > >>
> > >> are you talking about the clinical trial XML format used by
> > >> ClinicalTrials. <http://clinicaltrials.org/>gov by any chance?
> > >> If so, I did create a UIMA reader for these data. Its not perfect but
> > >> perhaps enough for your purposes and also you might want to enhance
> it.
> > >> Please let me know if you would be interested in that, I did not get
> > >> around to make it publicly available yet but could do so quickly.
> > >>
> > >> To answer the general question to the best of my knowledge:
> > >> There is no such thing as a general XML reader built-in into the UIMA
> > >> framework. For all non-trivial formats, a specific reader is
> necessary.
> > >> This also holds true with regard to the employed type system.
> > >> That being said, there are UIMA readers that try to serve as a general
> > XML
> > >> reading facility, e.g. the “XML Reader” from our lab (JULIELab,
> > >> https://github.com/JULIELab/jcore-base/tree/master/jcore-xml-reader <
> > >> https://github.com/JULIELab/jcore-base/tree/master/jcore-xml-reader
> >).
> > >> However, in my experience XML inputs come in a lot of different forms
> > >> which might often not be suitable to a generic approach which is why I
> > >> wrote quite a few UIMA readers for specific XML formats in the past.
> > >>
> > >> Hope that helps,
> > >>
> > >> Erik
> > >>
> > >>> On 20. Feb 2019, at 01:13, Bonnie MacKellar <bkmackel...@gmail.com>
> > >> wrote:
> > >>>
> > >>> This is probably a very naive question, but I can't seem to find
> > anything
> > >>> about this. I currently have a lot of XML files (clinical trial
> > >>> descriptions). My current workflow is to run a preprocessor that
> parses
> > >> the
> > >>> XML and generates text files in a simple format. I then run these
> files
> > >> in
> > >>> a UIMA pipeline, using FileCollectionReader to load the text files,
> > RUTA
> > >> to
> > >>> parse the simple format, the Metamap annotator to do some UMLS
> > >> annotations,
> > >>> and finally I have a writer that generates RDF triples from the UMIA
> > >>> annotations and loads the triples into a database. This has worked
> but
> > is
> > >>> clunky, especially the preprocessing. I feel like there has to be a
> > >> better
> > >>> way. Is there any support for reading XML files  or do I need to
> write
> > my
> > >>> own CollectionReader? Are there any other tools within UIMA for
> > handling
> > >>> XML text?
> > >>>
> > >>> thanks,
> > >>> Bonnie MacKellar
> > >>
> > >>
> >
> >
>


-- 
entia non sunt multiplicanda praeter necessitatem

Reply via email to