Leah,
If you know python, feel free to use the simple class I wrote to parse
cTAKES XMI files, attached. It only pulls out the information I needed for
my use case so you may need to adapt it.
Best,
Alden
On Tue, Jan 8, 2019 at 9:53 AM Smith, Lincoln <[email protected]>
wrote:
> I don't know of anything other than parsing the xml text to look for your
> preferred terminology and CUIs of interest in the text. Its not overly
> difficult in R if you google some of their xml parsing examples. Lincoln
>
>
>
> Lincoln Smith, MD, MS
>
> Director, Analytic Enablement
>
> Customer Engagement & Insight
>
> 412-544-8043
>
> [email protected]
>
>
>
> *From:* Baas,Leah [mailto:[email protected]]
> *Sent:* Tuesday, January 08, 2019 9:44 AM
> *To:* [email protected]
> *Subject:* [EXTERNAL] Filtering Annotated Files
>
>
>
> To whom it may concern,
>
>
>
> Hello! I am a student researcher who is new to NLP and cTAKES. I am trying
> to use cTAKES to extract clinical text indicative of BRCA mutations, and
> I’m feeling a bit lost. I’ve described my current progress below. Wondering
> if you can guide me to the next step:
>
>
>
> So far, I’ve been able to create .xml files for each subject in my
> dataset, run the files through the default clinical pipeline, and view the
> annotated output files in the CVD. However, my goal is to “filter” the
> annotations for concepts relevant to BRCA mutations (such as UMLS CUIs and
> SNOMED CT terms), and this is where I’m getting stuck. Is there a way to
> isolate these specific concepts within the cTAKES system? Or does this
> require post-processing using a different platform?
>
>
>
> Thanks for entertaining my amateur question!
>
>
>
> Leah Baas
>
>
>
> -----------------------------------------------------------------------
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> privileged and confidential information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message.
>
> ------------------------------
>
> The information contained in this transmission may contain privileged and
> confidential information including personal information protected by
> federal and/or state privacy laws. It is intended only for the use of the
> addressee named above. If you are not the intended recipient, you are
> hereby notified that any review, dissemination, distribution or duplication
> of this communication is strictly prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message. Highmark Health is a Pennsylvania nonprofit
> corporation. This communication may come from Highmark Health or one of its
> subsidiaries or affiliated businesses.
>
--
Alden Gordon
Director of Data Science & Analytics
(860) 402-6572
rubiconmd.com <https://www.rubiconmd.com/>
from lxml import etree
import numpy as np
import pandas as pd
class Parser:
def __init__(self):
"""
Parses XMI files output by cTAKES. Returns a pandas DataFrame of relevant attributes for
each annotation, e.g. CUI, preferred text, negation.
"""
self.elements = None
def _read_file(self, path):
raw = etree.parse(path)
root = raw.getroot()
self.elements = root.getchildren()
def _parse_tag(self, tag):
start = tag.find('}') + 1
return tag[start:]
def _parse_elements(self, elements):
# Create list of (attribute name, value) for each element
elem_tups = [list(zip(*elem.items())) for elem in elements]
# Pull the attribute names from the first element
cols = elem_tups[0][0]
# Create list of just the attribute values for each element
data = [elem_tup[1] for elem_tup in elem_tups]
# Create dataframe with data for each element
df = pd.DataFrame(data=data, columns=cols)
# Add a column for the tag, which can include semantic type (disease, symptom, etc)
df['tag'] = [self._parse_tag(elem.tag) for elem in elements]
return self._clean_elements(df)
def _clean_elements(self, element_df):
# If id is already a column, its the wrong one. Drop it.
if 'id' in element_df.columns:
element_df.drop('id', axis=1, inplace=True)
# Rename id column, which by default is "{http://www.omg.org/XMI}id"
element_df = element_df.rename(columns={"{http://www.omg.org/XMI}id": 'id'})
element_df['id'] = element_df['id'].astype(int)
return element_df
def _parse_concepts(self):
# Get all concetps
concepts = [el for el in self.elements if el.prefix == 'refsem']
concepts = self._parse_elements(concepts)
cols = ['id', 'cui', 'tui', 'preferredText']
return concepts[cols]
def _parse_semantic_elems(self):
# Semantic elements have textsem prefix and reference some concept
sems = [el for el in self.elements if el.prefix == 'textsem' and el.get('ontologyConceptArr')]
semantics_df = self._parse_elements(sems)
cols = ['id', 'begin', 'end', 'ontologyConceptArr', 'polarity', 'uncertainty', 'conditional',
'generic', 'subject', 'historyOf', 'tag']
semantics_df = semantics_df[cols]
return self._clean_semantic_elems(semantics_df)
def _clean_semantic_elems(self, semantics_df):
rename_map = {
'ontologyConceptArr': 'concept_id',
'preferredText': 'base',
'polarity': 'negation',
'historyOf': 'history_of',
'tag': 'type'
}
semantics_df = semantics_df.rename(columns=rename_map)
# Remove mention from types, e.g. DiseaseDisorderMention -> DiseaseDisorder
semantics_df['type'] = semantics_df['type'].str.replace('Mention', '')
# Convert polarity column from 1 vs -1 to boolean
semantics_df['negation'] = semantics_df['negation'].replace(['1', '-1'], [False, True])
# Convert other integers-as-text columns to boolean
cols = ['uncertainty', 'history_of']
semantics_df[cols] = semantics_df[cols].astype(int).astype(bool)
# Convert other bools-as-text columns to boolean
cols = ['conditional', 'generic']
semantics_df[cols] = semantics_df[cols].applymap(lambda s: eval(s.title()))
return self._unstack_concept_ids(semantics_df)
def _unstack_concept_ids(self, semantics_df):
# Concept IDs column can be a list. Duplicate rows for each item in list to facilitate joining
# to concept rows
lst_col = 'concept_id'
semantics_df[lst_col] = semantics_df[lst_col].str.split(' ')
df = pd.DataFrame({
# For each column other than the list column, repeat each row once for each item in the list column
col: np.repeat(
semantics_df[col].values, semantics_df[lst_col].str.len()
)
for col in semantics_df.columns.difference([lst_col])
})
# Add unstacked list column to the new dataframe
df = df.assign(**{lst_col: np.concatenate(semantics_df[lst_col].values)})
df['concept_id'] = df['concept_id'].astype(int)
# Apply original column order
return df[semantics_df.columns.tolist()]
def _merge(self, concepts_df, semantics_df):
merged = concepts_df.merge(
semantics_df.drop('id', axis=1), left_on='id', right_on='concept_id'
)
return merged.drop(['id', 'concept_id'], axis=1)
def parse(self, file_path):
"""
Parse a cTAKES XMI output file.
Parameters:
file_path: path to a single cTAKES XMI output file.
"""
self._read_file(file_path)
concepts = self._parse_concepts()
semantic_elems = self._parse_semantic_elems()
return self._merge(concepts, semantic_elems).drop_duplicates()