Leah,

If you know python, feel free to use the simple class I wrote to parse
cTAKES XMI files, attached. It only pulls out the information I needed for
my use case so you may need to adapt it.

Best,
Alden

On Tue, Jan 8, 2019 at 9:53 AM Smith, Lincoln <[email protected]>
wrote:

> I don't know of anything other than parsing the xml text to look for your
> preferred terminology and CUIs of interest in the text. Its not overly
> difficult in R if you google some of their xml parsing examples. Lincoln
>
>
>
> Lincoln Smith, MD, MS
>
> Director, Analytic Enablement
>
> Customer Engagement & Insight
>
> 412-544-8043
>
> [email protected]
>
>
>
> *From:* Baas,Leah [mailto:[email protected]]
> *Sent:* Tuesday, January 08, 2019 9:44 AM
> *To:* [email protected]
> *Subject:* [EXTERNAL] Filtering Annotated Files
>
>
>
> To whom it may concern,
>
>
>
> Hello! I am a student researcher who is new to NLP and cTAKES. I am trying
> to use cTAKES to extract clinical text indicative of BRCA mutations, and
> I’m feeling a bit lost. I’ve described my current progress below. Wondering
> if you can guide me to the next step:
>
>
>
> So far, I’ve been able to create .xml files for each subject in my
> dataset, run the files through the default clinical pipeline, and view the
> annotated output files in the CVD. However, my goal is to “filter” the
> annotations for concepts relevant to BRCA mutations (such as UMLS CUIs and
> SNOMED CT terms), and this is where I’m getting stuck. Is there a way to
> isolate these specific concepts within the cTAKES system? Or does this
> require post-processing using a different platform?
>
>
>
> Thanks for entertaining my amateur question!
>
>
>
> Leah Baas
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>
> ------------------------------
>
> The information contained in this transmission may contain privileged and
> confidential information including personal information protected by
> federal and/or state privacy laws. It is intended only for the use of the
> addressee named above. If you are not the intended recipient, you are
> hereby notified that any review, dissemination, distribution or duplication
> of this communication is strictly prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message. Highmark Health is a Pennsylvania nonprofit
> corporation. This communication may come from Highmark Health or one of its
> subsidiaries or affiliated businesses.
>


-- 

Alden Gordon
Director of Data Science & Analytics
(860) 402-6572



rubiconmd.com <https://www.rubiconmd.com/>
from lxml import etree
import numpy as np
import pandas as pd


class Parser:
    def __init__(self):
        """
        Parses XMI files output by cTAKES. Returns a pandas DataFrame of relevant attributes for
        each annotation, e.g. CUI, preferred text, negation.
        """
        self.elements = None

    def _read_file(self, path):
        raw = etree.parse(path)
        root = raw.getroot()
        self.elements = root.getchildren()

    def _parse_tag(self, tag):
        start = tag.find('}') + 1
        return tag[start:]

    def _parse_elements(self, elements):
        # Create list of (attribute name, value) for each element
        elem_tups = [list(zip(*elem.items())) for elem in elements]
        # Pull the attribute names from the first element
        cols = elem_tups[0][0]
        # Create list of just the attribute values for each element
        data = [elem_tup[1] for elem_tup in elem_tups]
        # Create dataframe with data for each element
        df = pd.DataFrame(data=data, columns=cols)
        # Add a column for the tag, which can include semantic type (disease, symptom, etc)
        df['tag'] = [self._parse_tag(elem.tag) for elem in elements]
        return self._clean_elements(df)

    def _clean_elements(self, element_df):
        # If id is already a column, its the wrong one. Drop it.
        if 'id' in element_df.columns:
            element_df.drop('id', axis=1, inplace=True)
        # Rename id column, which by default is "{http://www.omg.org/XMI}id";
        element_df = element_df.rename(columns={"{http://www.omg.org/XMI}id": 'id'})
        element_df['id'] = element_df['id'].astype(int)
        return element_df

    def _parse_concepts(self):
        # Get all concetps
        concepts = [el for el in self.elements if el.prefix == 'refsem']
        concepts = self._parse_elements(concepts)
        cols = ['id', 'cui', 'tui', 'preferredText']
        return concepts[cols]

    def _parse_semantic_elems(self):
        # Semantic elements have textsem prefix and reference some concept
        sems = [el for el in self.elements if el.prefix == 'textsem' and el.get('ontologyConceptArr')]
        semantics_df = self._parse_elements(sems)
        cols = ['id', 'begin', 'end', 'ontologyConceptArr', 'polarity', 'uncertainty', 'conditional',
                'generic', 'subject', 'historyOf', 'tag']
        semantics_df = semantics_df[cols]
        return self._clean_semantic_elems(semantics_df)

    def _clean_semantic_elems(self, semantics_df):
        rename_map = {
            'ontologyConceptArr': 'concept_id',
            'preferredText': 'base',
            'polarity': 'negation',
            'historyOf': 'history_of',
            'tag': 'type'
        }
        semantics_df = semantics_df.rename(columns=rename_map)
        # Remove mention from types, e.g. DiseaseDisorderMention -> DiseaseDisorder
        semantics_df['type'] = semantics_df['type'].str.replace('Mention', '')
        # Convert polarity column from 1 vs -1 to boolean
        semantics_df['negation'] = semantics_df['negation'].replace(['1', '-1'], [False, True])
        # Convert other integers-as-text columns to boolean
        cols = ['uncertainty', 'history_of']
        semantics_df[cols] = semantics_df[cols].astype(int).astype(bool)
        # Convert other bools-as-text columns to boolean
        cols = ['conditional', 'generic']
        semantics_df[cols] = semantics_df[cols].applymap(lambda s: eval(s.title()))
        return self._unstack_concept_ids(semantics_df)

    def _unstack_concept_ids(self, semantics_df):
        # Concept IDs column can be a list. Duplicate rows for each item in list to facilitate joining
        # to concept rows
        lst_col = 'concept_id'
        semantics_df[lst_col] = semantics_df[lst_col].str.split(' ')
        df = pd.DataFrame({
            # For each column other than the list column, repeat each row once for each item in the list column
            col: np.repeat(
                semantics_df[col].values, semantics_df[lst_col].str.len()
            )
            for col in semantics_df.columns.difference([lst_col])
        })

        # Add unstacked list column to the new dataframe
        df = df.assign(**{lst_col: np.concatenate(semantics_df[lst_col].values)})
        df['concept_id'] = df['concept_id'].astype(int)
        # Apply original column order
        return df[semantics_df.columns.tolist()]

    def _merge(self, concepts_df, semantics_df):
        merged = concepts_df.merge(
            semantics_df.drop('id', axis=1), left_on='id', right_on='concept_id'
        )
        return merged.drop(['id', 'concept_id'], axis=1)

    def parse(self, file_path):
        """
        Parse a cTAKES XMI output file.

        Parameters:
            file_path: path to a single cTAKES XMI output file.
        """
        self._read_file(file_path)
        concepts = self._parse_concepts()
        semantic_elems = self._parse_semantic_elems()
        return self._merge(concepts, semantic_elems).drop_duplicates()

Reply via email to