I started trying out the script last week. 1st bit of feedback: It was taking a lot longer than the estimated 20 minutes. Just building the 1st intermediate table took longer than that.
I have since lost my sesson/context (had to reboot my desktop... sigh...) I guess my high order feedback is: which end is up? What's the goal here? I was expecting something related to cohort characterization, but if it is related, I don't understand how. Perhaps it's clear/obvious to others. The script comment begins: I originally started out writing a cohort-characterization script that could be adapted to the needs of many different research project just by altering the cohort selection criteria (since most cohort characterization needs seem to follow the same basic pattern). However, from UTHSCSA's own trial-and-error and from discussions on the mailing lists it has become clear that for multi-site counts to get non-misleading results we need to have a very detailed characterization of each site's I2B2 semantics within its CRC (data repository) cell. I'm interested to know more about the trial-and-error steps. I've seen pretty good cohort characterization progress w.r.t. breast cancer and ALS based on HERON data builder results (which are analagous to i2b2 PDO results as seen in the timeline plug-in). I don't understand why that approach isn't suitable. The script comment continues: * To create a sort of empirical data dictionary we can all refer to in constructing federated queries that will run correctly at GPC sites as they *currently* are, so that cohort characterization can proceed in parallel with data standardization. How is this "emperical data dictionary" different from babel? We've been proceeding in parallel, no? We do manual terminology alignment in some cases (e.g. breast cancer procedures for #119) while we work out the details of automated terminology alignment (#160). * To help those working on data standardization see what else needs to be done and where it needs to be done the most. * To make certain data quality issues easier to spot. Data quality issues such as... what? We certainly have a long list of things to be done for data standardization. Can you give an example of how this script helps with determining where it needs to be done most? About the items below... For domain, why not the first part of the concept path? I wonder how free text got into tval_char; the HERON ETL scripts are fairly conservative about avoiding free text. Structuring a list of recommendations around modifier_cd and such seems odd to me. -- Dan ________________________________ From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Alex Bokov [bo...@uthscsa.edu] Sent: Monday, October 06, 2014 12:57 PM To: gpc-dev@listserv.kumc.edu Subject: Example (Re: Empirical Data Dictionary) Kind thanks to KUMC and the sites that volunteered to test. With help from Wisconsin and MCW so far, I have a lot of revisions to add to the original script that make the output smaller, the syntax less Oracle-specific, and eliminate or scrub certain fields. If you're not one of the test sites, you might want to hold out for the revised script that I'll send out after having had time to digest the initial reports. In response to some very valid concerns raised, the raw count tables will not be released without explicit consent from the originating sites. Instead, I propose to append to the enclosed table (currently representing UTHSCSA data) and send that out, along with other summary metrics that people might suggest as useful to data standardization and quality control. Is everyone okay with that? I'll have analogous summaries of the other three tables, but those are still in preparation. Here is the information I was able to obtain from running the scripts I sent earlier on our own data in San Antonio. In 33 lines we get a wealth of hints about how to make queries work as intended at our site, and things we need to improve. Part of this was done by aggregation of OUTPUT_CON_MOD within SQL, but part of it was just manual tinkering with autofilters on a spreadsheet and describing what I saw. Here is what each column means and why one should care: * site: Site for which this row of data was generated * domain: A consistent substring shared by a group of related CONCEPT_CDs (concept codes). These should be all listed in the YOURMETADATA.SCHEMES table, but I2B2 has no way of enforcing this rule, so instead I got these from the CONCEPT_CDs that are actually there. * tval_char: Usually this field to contain categoric values (ventral/dorsal/lateral, left/right, high/middle/low/below-threshold, etc.) but it often ends up with a lot of free-text values dumped into it. In San Antonio data, COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts use it as a time-stamp (but the time portions are all 0's), and MEDICATION_ID facts use it to indicate dose ranges (still need to look into whether these co-occur with NVAL_NUM fields, and if so, how often the values fail to match). Recommendations: field might be useful for chronologically ordering multi-day FLOW_MEAS_ID observations and quality-checking MEDICATION_ID observations. Further study of lab-value-specific usages of TVAL_CHAR may be needed before lab-values that use this field can be relied on either as raw data or as filtering criteria. Certain values in this field can indicate a numeric fact, but VALTYPE_CD is a more reliable indicator. * modifier_cd: Here are listed the observed rules for what (if any) MODIFIER_CDs can occur in observations from each domain. Currently the only domains in San Antonio data that use modifiers are diagnoses (ICD9 and DX_ID), allergies, medications, procedure orders, and family history. Labs also use modifiers, but these are rarely useful since they are hard-coded to be medians and last observations regardless of the size, time-span, or intended analysis of the lab value. Recommendations: For queries on diagnoses be mindful of whether you want only medical history, only active diagnoses, both, or just primary diagnoses [which in turn raises the question of how to reliably identify those]. Either way, to avoid duplicate counts, be sure to filter on MODIFIER_CD. For queries on a specific allergic reaction, use MODIFIER_CD. For queries on an allergen regardless of reaction type, leave MODIFIER_CD out, and use the appropriate SELECT DISTINCT ... GROUP BY ... subquery. For medications, the Inpatient/Outpatient/Historical/PRN/Other set of modifiers can co-exist with each other and with the Dose/Dose_MAR ones on the same drug during the same encounter. Pulling only facts with Dose or Dose_MAR modifiers might be enough to prevent redundancies, but not yet certain. For procedure orders filter on MODIFIER_CD = '@' if you want all procedures, or if you want just procedures of type XX 'MODIFIER_CD = 'PROCORDERS:XX', but do not ignore these modifiers or you will get redundant values. For family history always use modifiers and treat the combination of modifier and concept code as a distinct column OR if you don't care about what type of family history, just filter on MODIFIER_CD != 'FamRelation:18' and then have the appropriate SELECT DISTINCT ... GROUP BY ... subquery (or for confirmed negative family histories, MODIFIER_CD = 'FamRelation:18'). * valtype_cd: Only lab results (COMPONENT_ID), flowsheets (FLOW_MEAS_ID), meds (MEDICATION_ID), vitals (PAT_ENC), PACK_PER_DAY, and TOBACCO_USED_YEARS use this field but the latter two always have the value set to 'N'. Recommendations: can be used to streamline lengthy queries and for quality control. * Only COMPONENT_ID facts use both 'N' and 'T' values in this field. When it's 'N' then TVAL_CHAR is always one of E, G, GE, L, or LE and NVAL_NUM has a numeric value. When it's 'T' then TVAL_CHAR will have a comment-like value and NVAL_NUM will be NULL. * For FLOW_MEAS_ID facts, if this field is 'N' then the TVAL_CHAR field will always be 'E' and there will almost always be a value in the NVAL_NUM field; if this field is 'D' then TVAL_CHAR will contain a datestamp as described above and NVAL_NUM will be blank; if this field is '@' then TVAL_CHAR will also be '@' and NVAL_NUM will be blank. * For MEDICATION_ID, if this field is 'N' then the MODIFIER_CD will always be in the 'MedObs:Dose|%' class, TVAL_CHAR will be 'E', and NVAL_NUM will contain a numeric value. Otherwise this field is blank and MODIFIER_CD will be in the 'MedObs:[Inpatient|Outpatient|Historical|PRN|Other]' class, TVAL_CHAR will be a number, number-range, or '@', and NVAL_NUM will be empty. * For PAT_ENC, if this field is 'N' then TVAL_CHAR is 'E' and NVAL_NUM has a numeric value. The other permitted value seems to be '@' in which case TVAL_CHAR is also '@' and NVAL_NUM is blank. * PACK_PER_DAY and TOBACCO_USED_YEARS always have this field set to 'N' * valueflag_cd: This field is not currently used at San Antonio and all the values are either NULL, '@', or 'null'. * location_cd: This field is not currently used at San Antonio and all the values are NULL. * units_cd: Many different units, used only by COMPONENT_ID, FLO_MEAS_ID, and MEDICATION_ID. PACKS_PER_DAY and TOBACCO_USED_YEARS also specify units, but those are always 'Packs'. Recommendations: append to the name of the column of interest in the analyzable output, unless it is found that for a particular variable the units can vary... in that case, output as separate column. Guard against missing or incorrect unit values by looking for outliers that deviate by the value of the conversion factor. * COMPONENT_ID uses units or leaves this field blank seemingly independently of VALUETYPE_CD, TVAL_CHAR, NVAL_NUM, or MODIFIER_CD. * FLO_MEAS_ID always uses units when VALUETYPE_CD = 'N' and sometimes also uses them when it's '@' or 'D'. When units are present there is almost always a numeric value in NVAL_NUM. * MEDICATION_ID uses units or leaves this field blank seemingly independently of VALUETYPE_CD, TVAL_CHAR, NVAL_NUM, or MODIFIER_CD. * nval_num: COMPONENT_ID, FLOW_MEAS_ID, MEDICATION_ID, PAT_ENC, PACK_PER_DAY, and TOBACCO_USED_YEARS are the domains that use this field. When this field is used, it contains a numeric value. When VALTYPE_CD = 'N' this field is always (COMPONENT_ID, MEDICATION_ID, PAT_ENC) or usually (the other domains) used. Recommendations: this is where to find the raw data for most properly recorded numeric facts, and VALTYPE_CD can generally be relied on to find these facts. * concept_path: Here is only shown the beginning part of the concept path, to give you an idea of where to look for it in the ontology. May be additional paths that are not shown here. Recommendations: for quality control check concepts that don't exist in CONCEPT_DIMENSION * modifier_path: Here is only shown the beginning part of the concept path, to give you an idea of where to look for it in the ontology. May be additional paths that are not shown here. Recommendations: for quality control check modifiers that don't exist in MODIFIER_DIMENSION * modifier_name: A modifier name, for ease of interpretation. Warning: The above is just for San Antonio data. I don't know what of it generalizes to most data, and what is site specific. I2B2 design documents only say how these fields ought to behave, not how we're actually using them. The same is true for the intended behavior of the ETLs of individual sites. The only way I can think of to make sure is to do counts on actual co-occurrence of these fields in the raw OBSERVATION_FACT entries as was done above. As more sites run this and contribute their output, we will start to be able to draw inferences about what a federated query should do in order to run as intended at each site.
_______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev