I started trying out the script last week.

1st bit of feedback: It was taking a lot longer than the estimated 20 minutes. 
Just building the 1st intermediate table took longer than that.

I have since lost my sesson/context (had to reboot my desktop... sigh...)

I guess my high order feedback is: which end is up? What's the goal here? I was 
expecting something related to cohort characterization, but if it is related, I 
don't understand how. Perhaps it's clear/obvious to others.

The script comment begins:
I originally started out writing a cohort-characterization script that could be
adapted to the needs of many different research project just by altering the
cohort selection criteria (since most cohort characterization needs seem to
follow the same basic pattern).

However, from UTHSCSA's own trial-and-error and from discussions on the mailing
lists it has become clear that for multi-site counts to get non-misleading
results we need to have a very detailed characterization of each site's I2B2
semantics within its CRC (data repository) cell.
I'm interested to know more about the trial-and-error steps.

I've seen pretty good cohort characterization progress w.r.t. breast cancer and 
ALS based on HERON data builder results (which are analagous to i2b2 PDO 
results as seen in the timeline plug-in). I don't understand why that approach 
isn't suitable.

The script comment continues:

* To create a sort of empirical data dictionary we can all refer to in
  constructing federated queries that will run correctly at GPC sites as they
  *currently* are, so that cohort characterization can proceed in parallel with
  data standardization.
How is this "emperical data dictionary" different from babel? We've been 
proceeding in parallel, no? We do manual terminology alignment in some cases 
(e.g. breast cancer procedures for #119) while we work out the details of 
automated terminology alignment (#160).

* To help those working on data standardization see what else needs to be done
  and where it needs to be done the most.
* To make certain data quality issues easier to spot.

Data quality issues such as... what?

We certainly have a long list of things to be done for data standardization. 
Can you give an example of how this script helps with determining where it 
needs to be done most?

About the items below...

For domain, why not the first part of the concept path?

I wonder how free text got into tval_char; the HERON ETL scripts are fairly 
conservative about avoiding free text.

Structuring a list of recommendations around modifier_cd and such seems odd to 
me.

--
Dan


________________________________
From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of Alex Bokov [bo...@uthscsa.edu]
Sent: Monday, October 06, 2014 12:57 PM
To: gpc-dev@listserv.kumc.edu
Subject: Example (Re: Empirical Data Dictionary)

Kind thanks to KUMC and the sites that volunteered to test.

With help from Wisconsin and MCW so far, I have a lot of revisions to add to 
the original script that make the output smaller, the syntax less 
Oracle-specific, and eliminate or scrub certain fields. If you're not one of 
the test sites, you might want to hold out for the revised script that I'll 
send out after having had time to digest the initial reports. In response to 
some very valid concerns raised, the raw count tables will not be released 
without explicit consent from the originating sites. Instead, I propose to 
append to the enclosed table (currently representing UTHSCSA data) and send 
that out, along with other summary metrics that people might suggest as useful 
to data standardization and quality control. Is everyone okay with that?

I'll have analogous summaries of the other three tables, but those are still in 
preparation.

Here is the information I was able to obtain from running the scripts I sent 
earlier on our own data in San Antonio. In 33 lines we get a wealth of hints 
about how to make queries work as intended at our site, and things we need to 
improve. Part of this was done by aggregation of OUTPUT_CON_MOD within SQL, but 
part of it was just manual tinkering with autofilters on a spreadsheet and 
describing what I saw. Here is what each column means and why one should care:

  *   site: Site for which this row of data was generated
  *   domain: A consistent substring shared by a group of related CONCEPT_CDs 
(concept codes). These should be all listed in the YOURMETADATA.SCHEMES table, 
but I2B2 has no way of enforcing this rule, so instead I got these from the 
CONCEPT_CDs that are actually there.
  *   tval_char: Usually this field to contain categoric values 
(ventral/dorsal/lateral, left/right, high/middle/low/below-threshold, etc.) but 
it often ends up with a lot of free-text values dumped into it. In San Antonio 
data, COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts use it 
as a time-stamp (but the time portions are  all 0's), and MEDICATION_ID facts 
use it to indicate dose ranges (still need to look into whether these co-occur 
with NVAL_NUM  fields, and if so, how often the values fail to match). 
Recommendations: field might be useful for chronologically ordering multi-day 
FLOW_MEAS_ID observations and quality-checking MEDICATION_ID observations. 
Further study of lab-value-specific usages of TVAL_CHAR may be needed before 
lab-values that use this field can be relied on either as raw data or as 
filtering criteria. Certain values in this field can indicate a numeric fact, 
but VALTYPE_CD is a more reliable indicator.
  *   modifier_cd: Here are listed the observed rules for what (if any) 
MODIFIER_CDs can occur in observations from each domain. Currently the only 
domains in San Antonio data that use modifiers are diagnoses (ICD9 and DX_ID), 
allergies, medications, procedure orders, and family history. Labs also use 
modifiers, but these are rarely useful since they are hard-coded to be medians 
and last observations regardless of the size, time-span, or intended analysis 
of the lab value. Recommendations: For queries on diagnoses be mindful of 
whether you want only medical history, only active diagnoses, both, or just 
primary diagnoses [which in turn raises the question of how to reliably 
identify those]. Either way, to avoid duplicate counts, be sure to filter on 
MODIFIER_CD. For queries on a specific allergic reaction, use MODIFIER_CD. For 
queries on an allergen regardless of reaction type, leave MODIFIER_CD out, and 
use the appropriate SELECT DISTINCT ... GROUP BY ... subquery. For medications, 
the Inpatient/Outpatient/Historical/PRN/Other set of modifiers can co-exist 
with each other and with the Dose/Dose_MAR ones on the same drug during the 
same encounter. Pulling only facts with Dose or Dose_MAR modifiers might be 
enough to prevent redundancies, but not yet certain. For procedure orders 
filter on MODIFIER_CD = '@' if you want all procedures, or if you want just 
procedures of type XX 'MODIFIER_CD = 'PROCORDERS:XX', but do not ignore these 
modifiers or you will get redundant values. For family history always use 
modifiers and treat the combination of modifier and concept code as a distinct 
column OR if you don't care about what type of family history, just filter on 
MODIFIER_CD != 'FamRelation:18' and then have the appropriate SELECT DISTINCT 
... GROUP BY ... subquery (or for confirmed negative family histories, 
MODIFIER_CD = 'FamRelation:18').
  *   valtype_cd: Only lab results (COMPONENT_ID), flowsheets (FLOW_MEAS_ID), 
meds (MEDICATION_ID), vitals (PAT_ENC), PACK_PER_DAY, and TOBACCO_USED_YEARS 
use this field but the latter two always have the value set to 'N'. 
Recommendations: can be used to streamline lengthy queries and for quality 
control.
     *   Only COMPONENT_ID facts use both 'N' and 'T' values in this field. 
When it's 'N' then TVAL_CHAR is always one of E, G, GE, L, or LE and NVAL_NUM 
has a numeric value. When it's 'T' then TVAL_CHAR will have a comment-like 
value and NVAL_NUM will be NULL.
     *   For FLOW_MEAS_ID facts, if this field is 'N' then the TVAL_CHAR field 
will always be 'E' and there will almost always be a value in the NVAL_NUM 
field; if this field is 'D' then TVAL_CHAR will contain a datestamp as 
described above and NVAL_NUM will be blank; if this field is '@' then TVAL_CHAR 
will also be '@' and NVAL_NUM will be blank.
     *   For MEDICATION_ID, if this field is 'N' then the MODIFIER_CD will 
always be in the 'MedObs:Dose|%' class, TVAL_CHAR will be 'E', and NVAL_NUM 
will contain a numeric value. Otherwise this field is blank and MODIFIER_CD 
will be in the 'MedObs:[Inpatient|Outpatient|Historical|PRN|Other]' class, 
TVAL_CHAR will be a number, number-range, or '@', and NVAL_NUM will be empty.
     *   For PAT_ENC, if this field is 'N' then TVAL_CHAR is 'E' and NVAL_NUM 
has a numeric value. The other permitted value seems to be '@' in which case 
TVAL_CHAR is also '@' and NVAL_NUM is blank.
     *   PACK_PER_DAY and TOBACCO_USED_YEARS always have this field set to 'N'
  *   valueflag_cd: This field is not currently used at San Antonio and all the 
values are either NULL, '@', or 'null'.
  *   location_cd: This field is not currently used at San Antonio and all the 
values are NULL.
  *   units_cd: Many different units, used only by COMPONENT_ID, FLO_MEAS_ID, 
and MEDICATION_ID. PACKS_PER_DAY and TOBACCO_USED_YEARS also specify units, but 
those are always 'Packs'. Recommendations: append to the name of the column of 
interest in the analyzable output, unless it is found that for a particular 
variable the units can vary... in that case, output as separate column. Guard 
against missing or incorrect unit values by looking for outliers that deviate 
by the value of the conversion factor.
     *   COMPONENT_ID uses units or leaves this field blank seemingly 
independently of VALUETYPE_CD, TVAL_CHAR, NVAL_NUM, or MODIFIER_CD.
     *   FLO_MEAS_ID always uses units when VALUETYPE_CD = 'N' and sometimes 
also uses them when it's '@' or 'D'. When units are present there is almost 
always a numeric value in NVAL_NUM.
     *   MEDICATION_ID uses units or leaves this field blank seemingly 
independently of VALUETYPE_CD, TVAL_CHAR, NVAL_NUM, or MODIFIER_CD.
  *   nval_num: COMPONENT_ID, FLOW_MEAS_ID, MEDICATION_ID, PAT_ENC, 
PACK_PER_DAY, and TOBACCO_USED_YEARS are the domains that use this field. When 
this field is used, it contains a numeric value. When VALTYPE_CD = 'N' this 
field is always (COMPONENT_ID, MEDICATION_ID, PAT_ENC) or usually (the other 
domains) used. Recommendations: this is where to find the raw data for most 
properly recorded numeric facts, and VALTYPE_CD can generally be relied on to 
find these facts.
  *   concept_path: Here is only shown the beginning part of the concept path, 
to give you an idea of where to look for it in the ontology. May be additional 
paths that are not shown here. Recommendations: for quality control check 
concepts that don't exist in CONCEPT_DIMENSION
  *   modifier_path: Here is only shown the beginning part of the concept path, 
to give you an idea of where to look for it in the ontology. May be additional 
paths that are not shown here. Recommendations: for quality control check 
modifiers that don't exist in MODIFIER_DIMENSION
  *   modifier_name: A modifier name, for ease of interpretation.

Warning: The above is just for San Antonio data. I don't know what of it 
generalizes to most data, and what is site specific. I2B2 design documents only 
say how these fields ought to behave, not how we're actually using them. The 
same is true for the intended behavior of the ETLs of individual sites. The 
only way I can think of to make sure is to do counts on actual co-occurrence of 
these fields in the raw OBSERVATION_FACT entries as was done above. As more 
sites run this and contribute their output, we will start to be able to draw 
inferences about what a federated query should do in order to run as intended 
at each site.


_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to