Re: Empirical Data Dictionary v 0.2.0.0

Alex Bokov Mon, 20 Oct 2014 18:04:05 -0700

Here is the updated version of the EDD that I promised last week.


New:

* Obfuscates OBSERVATION_FACT.TVAL_CHAR,PROVIDER_DIMENSION.PROVIDER_PATH, and PROVIDER_DIMENSION.NAME_CHAR soyou can still see how and whether these fields get used without seeingany actual text (with the exception of safe and commonly occuringtag-values for TVAL_CHAR such as 'E', '@', 'NONE', etc.)

* Much, much smaller output.

* No more `WITH` statements to avoid incompatibility with non Oracledatabases (but, sorry, now there is NVL, REGEXP_REPLACE, andREGEXP_SUBSTR... I'll look up the Postgres and Microsoft equivalents andupdate the script unless someone beats me to it)* Temporary tables that get created also get deleted by the scriptexcept OBS_TEMP in case you want to explore it some more on your own.* More detailed explanations of stuff and instructions on how you canuse these tables internally to hunt down problems including a breakdownof all visits (supposedly) occurring in each calendar year in your system.

Thanks to those who have tested the previous version. Those who sent inoutput, no need to re-send, I should be able to condense the big outputyou sent to the shorter output this version generates. Those who wereput off either by file-size or by text fields, I believe I haveaddressed the problems that were found, and look forward to your furtherinput.

/* Empirical Data Dictionary, v 0.2.0.0 */

/**
I originally started out writing a cohort-characterization script that could be
adapted to the needs of many different research project just by altering the 
cohort selection criteria (since most cohort characterization needs seem to 
follow the same basic pattern).

However, from UTHSCSA's own trial-and-error and from discussions on the mailing 
lists it has become clear that for multi-site counts to get non-misleading 
results we need to have a very detailed characterization of each site's I2B2 
semantics within its CRC (data repository) cell. The immediate goals of this 
script are:

* To create a sort of empirical data dictionary we can all refer to in 
  constructing federated queries that will run correctly at GPC sites as they 
  *currently* are, so that cohort characterization can proceed in parallel with 
  data standardization.
* To help those working on data standardization see what else needs to be done
  and where it needs to be done the most.
* To make certain data quality issues easier to spot.

This script ONLY does counts-- no individual records are extracted into the 
output tables. Also, no dates, patient_nums, or location information will end
up in the output tables.

This script was tested on SQL*Developer but will probably work in
a SQL*Plus console. The runtime is estimated to be about an hour.

INSTRUCTIONS:

1. Set the appropriate values for `dataschema` and `yoursite`. Please leave 
   `conceptinfo` as-is.
2. Run this script as a user who has select access to the schema you specified
   as your `dataschema`. That is the only external schema you'll need for this 
   version of the script. The user running this script will also need 
   permissions to create tables in their own schema. No tables will be created
   anywhere else. For some of the tables, SQL*Developer might output...
   "Failed: Warning: execution completed with warning"
   This might have something to do with ...
   https://support.oracle.com/rs?type=bug&id=7526579 
   Long story short, the tables do get created, so disregard the warning.
3. After running the script, please either export the tables OUTPUT_CON_MOD,
   OUTPUT_BASIC_DEMOG, and OUTPUT_PROVIDER to .csv files or view them in 
   SQL*Developer, select all, and copy-paste into spreadsheets.
4. Email to bo...@uthscsa.edu 

Thank you very, very much. Please let me know if you want a copy the GPC-wide
results or if you have questions or suggestions.
**/

/** the name of the schema where you keep your DE-identified data 
    (e.g. OBSERVATION_FACT) **/
define dataschema = BLUEHERONDATA;
/** A string by which to identify your site **/
define yoursite = UTHSCSA;

-- to empirically find out what data-type each concept_cd
define conceptinfo = "valtype_cd,valueflag_cd,location_cd,units_cd";

-- values of TVAL_CHAR that are passed to the output unobfuscated
define std_tvals="'@','E','NEGATIVE','NOT APPLICABLE','G','SEE 
NOTE','NON-REACTIVE','NO INFO','GE','L','POSITIVE','YES'";

/** temporary tables needed for faster creation of the concepts table **/
CREATE TABLE tmpcons AS (SELECT ccd
  ,listagg(conpath,';') WITHIN GROUP (ORDER BY NULL) conpath FROM 
  (SELECT DISTINCT concept_cd 
ccd,regexp_substr(concept_path,'\\[^\\]+\\[^\\]+\\[^\\]+\\') conpath
    FROM &dataschema..concept_dimension) GROUP BY ccd);

/* this is the slowest one, sorry about that */    
CREATE TABLE tmpot1 AS (SELECT DISTINCT nvl(Regexp_Substr(obf.Concept_Cd,'^.*:')
  ,concept_cd) domain ,patient_num,start_date,obf.concept_cd,modifier_cd
  /* If TVAL_CHAR is one of the code-like values, we pass it through as-is.
    The upper/lower case is deliberately left as-is to help identify domains 
    where there are prevalent case-clones of what should be the same value */
  ,CASE WHEN upper(tval_char) IN (&std_tvals) THEN tval_char
  ELSE REPLACE(REPLACE(regexp_replace(
  /* However, if TVAL_CHAR is some long, complex string we first replace all the
    numeric characters with 0s, then replace all the alphabetic characters with 
    Xs, and leave all spaces and punctuation as-is. Then, we replace 0.0 and .0
    with 0. So, if you see 0 in the output later on, keep in mind that this 
    encompasses both integer and float values. Also, a datestamp might look like
    0-0-0 0:0XX. Now, we are protected against sensitive information leaking 
    into the output and also against overly large output: for UTHSCSA data, 
    version 0.1.0 of this script produced and OBS_TEMP that was almost 
    70 million lines long. This one produces an OBS_TEMP that is 2658 lines 
    long! */
    regexp_replace(tval_char,'[0-9]+',0),'[[:alpha:]]','X'),'0.0',0),'.0',0)
  END tval_char, &conceptinfo
  /* if NVAL_NUM is not null, we assume it is numeric. The raw value is not 
    returned of course */
  ,CASE WHEN nval_num IS NULL THEN 'None' ELSE 'Numeric' END nval, conpath
  FROM &dataschema..observation_fact obf 
  LEFT JOIN tmpcons ON concept_cd = ccd);
-- 1517.7 seconds

/* this is the second slowest one, but then it's all faster from there on out */
-- drop table obs_temp;
CREATE TABLE obs_temp AS
/* the tmpot1 temporary table does all the value transformations and exclusion
  of dupliates. The problem we're trying to solve is that TVAL_CHAR sometimes 
  acts as a code and sometimes as a freeform string... and this behavior is
  governed by rules we don't know for certain (these rules are among the things
  we are trying to discover by running this script). So we simplify TVAL_CHAR */
(SELECT domain,modifier_cd,tval_char,&conceptinfo,nval,count(*) n,conpath
  FROM tmpot1 GROUP BY domain,modifier_cd,tval_char,&conceptinfo,nval,conpath);
-- 640.8 seconds
/* Create tmpot2 from ot1 by omitting TVAL_CHAR and UNITS_CD and summing over 
the 
  individual counts, as per above. The ojoin column is for convenience, to join
  tdeflate and udeflate below without spamming redundant 'ON' statements. The 
  nvl(...,'@|') are there so ambiguous strings aren't created when there are 
  adjacent NULL values*/
-- drop table tmpot2;
CREATE TABLE tmpot2 AS
(SELECT domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,sum(n) n
  
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
 ojoin
  FROM obs_temp where n <=10 GROUP BY 
domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.062 seconds, 59 rows

/* Many of the combinations in OBS_TEMP are rather rare. We need to know that 
  they can occur, but they might not merit individual attention as much as 
  common occurences. The main culprits (at least at San Antonio, please let us
  know if you find additional ones) are TVAL_CHAR and UNITS_CD. Since we're also
  aggregating by the first three levels of the concept hierarchy, we do those 
  too. Here we take the rare occurences and aggregate all their unique 
  TVAL_CHAR, UNITS_CD, and CONPATH values into ';'-delimited strings. The Ns 
are 
  aggregated by summing. We begin by creating a temporary table of just the 
rare 
  value combinations, to hopefully speed up the query by reducing the 
  search-space. */
/* create from obs_temp by aggregating distinct UNITS_CD values */
-- drop table tdeflate;
create table tdeflate as (
  SELECT domain,modifier_cd
  ,listagg(tval_char,';') WITHIN GROUP(ORDER BY NULL) tval_char
  ,valtype_cd,valueflag_cd,location_cd,nval
  /* the following column is created for joining on ojoin, above */
  
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
 tjoin
  FROM (SELECT DISTINCT 
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd,nval
  FROM obs_temp)
  GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.327 seconds, 126 rows

/* create from obs_temp by aggregating distinct UNITS_CD values */
create table udeflate as (
  SELECT domain,modifier_cd
  ,listagg(units_cd,';') WITHIN GROUP(ORDER BY NULL) units_cd
  ,valtype_cd,valueflag_cd,location_cd,nval
  /* the following column is created for joining on ojoin, above */
  
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
 ujoin
  FROM (SELECT DISTINCT 
domain,modifier_cd,units_cd,valtype_cd,valueflag_cd,location_cd,nval
  FROM obs_temp)
  GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.08 seconds, 126 rows

/* create from obs_temp by aggregating distinct CONPATH values */
-- drop table cdeflate ;
create table cdeflate AS (
  SELECT domain,modifier_cd
  ,substr(listagg(conpath,';') WITHIN GROUP(ORDER BY NULL),1,100) conpath
  ,valtype_cd,valueflag_cd,location_cd,nval
  /* the following column is created for joining on ojoin, above */
  
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
 cjoin
  FROM (SELECT DISTINCT 
domain,modifier_cd,units_cd,valtype_cd,valueflag_cd,location_cd,nval,conpath
  FROM obs_temp)
  GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.212 seconds, 126 rows

/* Now, take the non-rare concepts and append to them the above rare ones, 
  with TVAL_CHAR, UNITS_CD, and CONPATH joined back onto them. For MySQL users 
  wondering why we can't just use GROUP_CONCAT(DISTINCT ...), it's because 
  Oracle DOESN'T HAVE anything equivalent to that. I know, right? */
-- drop table deflated;
CREATE TABLE deflated AS (
  SELECT * FROM obs_temp WHERE n > 10
  union all
  SELECT ot2.domain,ot2.modifier_cd,tdeflate.tval_char,ot2.valtype_cd
  ,ot2.valueflag_cd,ot2.location_cd,udeflate.units_cd,ot2.nval,n
  ,REPLACE(conpath,'i2b2','') conpath
  FROM tmpot2 ot2 LEFT JOIN tdeflate ON ojoin = tjoin 
  LEFT JOIN udeflate ON ojoin = ujoin LEFT JOIN cdeflate ON ojoin = cjoin );
-- 0.352 seconds, 2658 rows

/* We're still not done; we need to add some information for interpreting what
  the concept modifiers mean. The mods temporary table maps modifier paths and 
  names to CONCEPT_CD. At UTHSCSA MODIFIER_CD has a one-to-one relationship 
  with MODIFIER_PATH and NAME_CHAR but just in case this is violated at other 
sites, we still do 
  LISTAGG(...) ... for them */
create table tmpmods as (
  SELECT modifier_cd mcd
  ,listagg(modifier_path,';') WITHIN GROUP(ORDER BY NULL) modpath
  ,listagg(name_char,';') WITHIN GROUP(ORDER BY NULL) modname
  FROM (
  SELECT DISTINCT modifier_cd,modifier_path,name_char
  FROM &dataschema..modifier_dimension WHERE modifier_cd IS NOT NULL) 
  GROUP BY modifier_cd);
-- 0.432, 94 rows



-- How many rows in obs_temp total, with > 10 occurences, and with <= 10 
occurrences ?
SELECT 'Total rows' what, count(*) FROM obs_temp -- 5510
UNION ALL
SELECT 'N > 100' what, count(*) FROM obs_temp WHERE n > 10 -- 2599
UNION ALL
SELECT 'N <= 100' what, count(*) FROM obs_temp WHERE n <=10 --2911
;

/** 
This is a table counting the number of occurrences for each combination of
CONCEPT_CD and its various attributes along with their paths and descriptions 
where available. Some uses for this table:

* What MODIFIER_CDs can accompany a given type of CONCEPT_CD? For example at 
UTHSCSA if
  I run ...
  SELECT MODIFIER_CD,sum(N) FROM OUTPUT_CON_MOD WHERE DOMAIN = 'ICD9:' GROUP BY 
MODIFIER_CD;
  ...I find that every MODIFIER_CD associated with an 'ICD9:' concept either 
  looks like 'DiagObs:XXX' or 'PROBLEM_STATUS_C:X'. So I only need to worry 
  about understanding and properly filtering on this set of codes to properly 
  search diagnoses facts. I also notice that the single most commonly used 
  MODIFIER_CD is 'DiagObs:MEDICAL_HX', in other words, diagnoses that are part
  of the patient's medical history at the time of the visit, rather than active
  diagnoses. If I don't filter for this, my diagnosis counts are going to be 
  inflated!
* What text values exist for a given type of CONCEPT_CD?
  SELECT TVAL_CHAR,sum(N) FROM OUTPUT_CON_MOD WHERE DOMAIN like '%ALLERGEN:%' 
GROUP BY TVAL_CHAR;
  For UTHSCSA, it turns out that allergy facts don't use TVAL_CHAR, one less 
  thing to worry about. What about medications? Replace '%ALLERGEN:%' with 
  '%MEDICATION_ID:%' in the above query and I find some '@', some 'E', some '0'
  which means integers of various lengths, some 0-0, which means integer ranges,
  and several semicolon-separated strings that indicate rare combinations of 
  values that have been binned together. 
* So, under what circumstances to meds have a numeric range in their TVAL_CHAR?
  SELECT modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,modname,sum(N) 
  FROM output_con_mod WHERE domain LIKE '%MEDICATION_ID:%' AND tval_char = '0-0'
  group by modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,modname;
  At UTHSCSA all of these have modifiers indicating medication orders of various
  types, and for all such observations the NVAL_NUM field is empty. On the 
other 
  hand if we do the above query with 'E' instead of '0-0' we only get 
  observations whose modifiers are dosages, VALTYPE_CD='N' and the NVAL_NUM
  fields are not empty (presumably contain numeric values). From this we can 
  conclude that for some medication orders, when the intended dose is a range,
  it ends up int the TVAL_CHAR field instead of the NVAL_NUM field where you 
  would normally expect numeric values to end up.
* You can see what types of concepts are the most prevalent at a site... even if
  their metadata tables are out of date because this table doesn't use
  metadata, it's based directly on counts of OBSERVATION_FACT:
  SELECT domain,SUM(n) N FROM output_con_mod GROUP BY domain ORDER BY N DESC;
  DX_ID type diagnoses are more common than ICD9 type diagnoses (which makes 
  sense because DX_ID are more granular). Observations in the ALLERGEN domain 
  are relatively rare and ones in the MicroPositive domain are very rare 
  (at UTHSCSA).
* You can globally see what types of concepts can occur with what types of 
  modifiers:
  SELECT domain,modifier_cd,SUM(n) n FROM output_con_mod 
  GROUP BY domain,modifier_cd ORDER BY domain;
  Looks like observations in the PAT_ENC domain never have modifiers, and all
  modifiers for the FAMILYHISTORYDIAG domain look like 'FamRelation:%' (and are
  never '@').
* If you have CONPATHs that are null, it may mean that those 
  observatoins are orphaned and not visible from queries that depend on 
  metadata... but now you know about their existance and can remedy this and in
  the meantime query for them in OBSERVATION_FACT 
  SELECT domain,sum(n) FROM output_con_mod WHERE conpath IS NULL GROUP BY 
domain;
  Looks like UTHSCSA has some work to do mapping lab, flowsheet, medication, 
and 
  IC9 concepts.
**/
/* Okay, NOW we're done. We replace the ubiquitous and therefore uninformative
  i2b2 prefix to save space, because some groups of concepts have many paths 
  and this column can get very long otherwise */
--DROP TABLE output_con_mod;
CREATE TABLE OUTPUT_CON_MOD AS
SELECT domain,modifier_cd
,CASE WHEN LENGTH(tval_char)>200 THEN substr(tval_char,1,200)||'...' ELSE 
tval_char END tval_char
,valtype_cd,valueflag_cd,location_cd
,CASE WHEN LENGTH(units_cd)>200 THEN substr(units_cd,1,200)||'...' ELSE 
units_cd END units_cd
,nval,n
,CASE WHEN LENGTH(regexp_replace(conpath,'\\i2b2\\',''))>200 THEN 
substr(regexp_replace(conpath,'\\i2b2\\',''),1,200)||'...' ELSE conpath END 
conpath
,modpath,modname
FROM deflated LEFT JOIN tmpmods ON modifier_cd = mcd
ORDER BY domain,n DESC
;
-- 0.549 seconds, 2725 rows
/* drop most of the temp tables */
DROP TABLE tmpcons; DROP TABLE tmpot1; DROP TABLE tmpot2; 
DROP TABLE tdeflate; DROP TABLE udeflate; DROP TABLE cdeflate; 
DROP TABLE deflated; DROP TABLE tmpmods;
/* Not dropping OBS_TEMP because it might be useful to you for troubleshooting
   can drop it manually, or by uncommenting some or all of the cleanup block of
   commands at the bottom of this script */

/* How many rows? */
SELECT 'All' what,count(*),sum(n) records FROM output_con_mod
UNION ALL
select 'Not distinguishing CONPATHs' what,count(*),sum(n) records from (SELECT 
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,sum(n) n,modpath,modname FROM output_con_mod
GROUP BY domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,modpath,modname)
UNION ALL
select 'Omitting NULL CONPATHs and not distinguishing them' what, 
count(*),sum(n) records from (SELECT 
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,sum(n) n,modpath,modname FROM output_con_mod
where conpath is not NULL
GROUP BY domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,modpath,modname)
UNION ALL
SELECT 'Only very large groups of CONPATHs' what,count(*),sum(n) records FROM 
output_con_mod
where length(conpath) > 100
;
-- 2725 rows in output_con_mod, covering 68571937 observations
-- 1011 rows disregarding concept paths, covering 68571937 observations
-- 949 rows after first eliminating all null concept paths, covering 65394658 
observations
-- 13 rows where the sum of concept path lengths > 100, covering 2242428 
observations


/* Some interesting output for your own information...
-- can be used to investigate which types of observations span many different 
CONPATHs
select * from output_con_mod;
SELECT domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,n,substr(conpath,1,500) conpath,modpath,modname FROM 
output_con_mod
where length(conpath) > 100;

-- can be used to investigate which types of observations aren't mapped to 
CONPATHs
select * from output_con_mod where conpath is NULL;


-- break down length of conpath by frequency and number of observations
SELECT lc,count(*) nl,sum(n) nn FROM (SELECT LENGTH(conpath) lc,n FROM 
output_con_mod) 
GROUP BY lc ORDER BY lc;
*/

/** This counts all existing distinct combinations of demographic variables
in patient_dimension and breaks them down by visit year. The first 'year', 9998
actually represents ALL distinct patients in the database regardless of year. 
(e.g. total number of living female white english-speakers aged 6-12, total 
number of living male white english-speakers aged 6-12, and so on). The 
remaining columns are likewise, but counts of patients from each group seen only
during those respective years. There is a method to the madness for how the 
other years were chosen: 1988 is the earliest year any GPC site reports having 
records. 2015 and 2016 haven't happened yet, so if you have any counts in those 
years, you have a good reason to be curious. 9999 and 0 are the 'foo' and 'bar' 
of four digit integers, and might have been used by someone to tag missing or 
otherwise problematic records, so we use 9998 to code our catch-all year 
instead.
some uses for this table:
* Spotting nonsensical combinations of demographic values, like patients in the 
  1-5 age-range who have a marital status of 'divorced'
* Spotting negative ages or missing ages
* Spotting demographic variables that aren't used at a given site... for example
  at UTHSCSA, it looks like we don't currently use RELIGION_CD or INCOME_CD, so
  so if someone sends us a query that filters on those, they won't get any 
  results!
* Seeing what terms are valid for the various columns of PATIENT_DIMENSION at 
  each site... if your query filters on `SEX_CD = 'f'` and a given site uses 
  values of 'Male','Female', and '@' for SEX_CD, that's another query that 
won't 
  return results.
* Collapsing it into a more manageable pivot table and using it to see which 
  years have the most patients in the demographic groups that interest you. For
  example, here's how you would do an age breakdown for the years 2012-2014:
  SELECT * FROM output_basic_demog;
  WITH dem0 AS (SELECT sex_cd,age_range,yr
    ,sex_cd||' '||age_range joinby 
    ,SUM(n_patients) n_patients FROM 
    output_basic_demog GROUP BY sex_cd,age_range,yr),
    d12 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2012),
    d13 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2013),
    d14 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2014),
    grp AS (SELECT DISTINCT sex_cd,age_range,joinby FROM dem0)
  SELECT age_range,sex_cd
    ,d12.n_patients "2012",d13.n_patients "2013",d14.n_patients "2014" 
    FROM grp LEFT JOIN d12 ON grp.joinby = d12.joinby
    LEFT JOIN d13 ON grp.joinby = d13.joinby
    LEFT JOIN d14 ON grp.joinby = d14.joinby
    order by age_range,sex_cd;
**/

/* temporary tables for creating output_basic_demog, created to avoid using
  Oracle's non-portable WITH syntax */
CREATE TABLE visits0 AS (SELECT DISTINCT patient_num pn,start_date
  ,EXTRACT(YEAR FROM start_date) yr 
  FROM &dataschema..observation_fact);
-- 161.0 seconds, 10339581 rows (not necessarily that many visits!)

CREATE TABLE visits1 AS (SELECT pn,count(*) n_visits,yr FROM visits0 GROUP BY 
pn,yr);
-- 24.2 seconds, 5365933 rows

CREATE TABLE visits2 AS (SELECT 
n_visits,yr,language_cd,vital_status_cd,sex_cd,race_cd
  ,marital_status_cd,religion_cd,income_cd
  ,case when AGE_IN_YEARS_NUM is NULL then 'Missing'
   when AGE_IN_YEARS_NUM < 0 then 'Negative'
   when AGE_IN_YEARS_NUM < 1 then '< 1'
   when AGE_IN_YEARS_NUM <= 5 then '1-5'
   when AGE_IN_YEARS_NUM <= 12 then '6-12'
   when AGE_IN_YEARS_NUM <= 18 then '13-18'
   when AGE_IN_YEARS_NUM <= 65 then '19-65'
   WHEN AGE_IN_YEARS_NUM <= 85 THEN '66-85'
   ELSE '>85' END age_range
  FROM visits1
  LEFT JOIN &dataschema..patient_dimension ON pn = patient_num);
-- 30.5 seconds, 5365933 rows

CREATE TABLE visits3 AS (
SELECT sum(n_visits) n_visits,count(*) n_patients,yr,language_cd
  ,vital_status_cd,sex_cd,race_cd,marital_status_cd,religion_cd,income_cd
  ,age_range 
  FROM visits2 GROUP BY yr,language_cd,vital_status_cd,sex_cd,race_cd
  ,marital_status_cd,religion_cd,income_cd,age_range
);
-- 9.9 seconds, 55175 rows

-- this part is a count over all the years
CREATE TABLE allyears AS (SELECT sum(n_visits) n_visits,count(*) 
n_patients,9998 yr
  ,language_cd,vital_status_cd,sex_cd,race_cd,marital_status_cd,religion_cd
  ,income_cd,age_range 
  FROM visits2 GROUP BY language_cd,vital_status_cd,sex_cd,race_cd
  ,marital_status_cd,religion_cd,income_cd,age_range);
-- 8.7 seconds, 3982 rows
  
-- DROP TABLE output_basic_demog; 
CREATE TABLE OUTPUT_BASIC_DEMOG AS
(SELECT * FROM visits3 WHERE n_patients > 10 
UNION ALL
SELECT * FROM allyears WHERE n_patients > 10);
-- 0.213 seconds, 19957 rows

/* clean up the temporary demog tables */
DROP TABLE visits0; DROP TABLE visits1; DROP TABLE visits2; DROP TABLE visits3;
DROP TABLE allyears;

/* So, when is your data observed? */
SELECT 'Total visits' what, sum(n_visits) N 
FROM output_basic_demog where yr != 9998 -- 20469187
UNION ALL
SELECT 'Visits supposedly before 1950' what, sum(n_visits) N
from output_basic_demog where yr <= 1950 -- 261554
UNION ALL
SELECT 'Visits supposedly between 1950 and 1999' what, sum(n_visits) N
from output_basic_demog where yr >= 1950 and yr < 1999 -- 1114215
UNION ALL
SELECT 'Visits between 2000 and 2006' what, sum(n_visits) N
FROM output_basic_demog WHERE yr >= 2000 AND yr < 2006 -- 134181
UNION ALL
SELECT 'Visits between 2006 and present' what, sum(n_visits) N
FROM output_basic_demog WHERE yr >= 2006 AND yr <= 2014 -- 8632197
UNION ALL
SELECT 'Visits supposedly in the future' what, sum(n_visits) N
FROM output_basic_demog WHERE yr > 2014 and yr != 9998 -- 77
;

select yr,sum(n_visits) from output_basic_demog group by yr order by yr;

/** Surveying to see for what if anything people are using the columns within 
VISIT_DIMENSION. Might identify sites where it's possible to query by clinic or
by visit type.
**/
create table OUTPUT_VISIT as
select '&yoursite' site,ACTIVE_STATUS_CD,INOUT_CD
,LOCATION_CD,LOCATION_PATH
,min(LENGTH_OF_STAY)+0 minstay,max(LENGTH_OF_STAY)+0 maxstay,count(*) N
from &dataschema..visit_dimension 
group by ACTIVE_STATUS_CD,INOUT_CD,LOCATION_CD,LOCATION_PATH;

/** If your sites has provider information, this counts distinct patients by 
provider. At least, that's what I hope it does: UTHSCSA's PROVIDER_DIMENSION is
currently empty. :-(
We now replace all character strings with runs of Xs of equal lengths, and all
numbers with 0, leaving spaces and punctuation in place.
**/
-- drop table output_provider;
create table OUTPUT_PROVIDER as
SELECT '&yoursite' site,count(*) patients
,MIN( 
regexp_replace(regexp_replace(provider_path,'[[:alpha:]]','X'),'[0-9]+',0)) 
provider_path
,MIN( regexp_replace(regexp_replace(name_char,'[[:alpha:]]','X'),'[0-9]+',0)) 
name_char
from (select distinct provider_id pi,patient_num 
  from &dataschema..observation_fact) obs
left join &dataschema..provider_dimension prv
on obs.pi = prv.provider_id
where obs.pi is not NULL
group by pi;

/*** Thank you for your help in helping us understand how the various fields in
  I2B2 are currently used by each of the sites. Please look over the four tables
  whose names start with OUTPUT_, and once you're satisfied that they are okay 
  to share, dump them to CSV and email to bo...@uthscsa.edu 
  ...and if you see a problem, DON'T email them to me, but please DO let me know
  what the problem is so I can fix it!
  Food for thought: whatever you currently have in TVAL_CHAR is visible through 
  the ordinary I2B2 web interface in timeline view
***/

/** to delete the tables not already deleted above, uncomment the following as 
needed**/
--drop table OBS_TEMP;
--drop table OUTPUT_CON_MOD;
--drop table OUTPUT_BASIC_DEMOG;
--drop table OUTPUT_PROVIDER;
--drop table OUTPUT_VISIT;

_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Re: Empirical Data Dictionary v 0.2.0.0

Reply via email to