Here is the updated version of the EDD that I promised last week.
New:
* Obfuscates OBSERVATION_FACT.TVAL_CHAR,
PROVIDER_DIMENSION.PROVIDER_PATH, and PROVIDER_DIMENSION.NAME_CHAR so
you can still see how and whether these fields get used without seeing
any actual text (with the exception of safe and commonly occuring
tag-values for TVAL_CHAR such as 'E', '@', 'NONE', etc.)
* Much, much smaller output.
* No more `WITH` statements to avoid incompatibility with non Oracle
databases (but, sorry, now there is NVL, REGEXP_REPLACE, and
REGEXP_SUBSTR... I'll look up the Postgres and Microsoft equivalents and
update the script unless someone beats me to it)
* Temporary tables that get created also get deleted by the script
except OBS_TEMP in case you want to explore it some more on your own.
* More detailed explanations of stuff and instructions on how you can
use these tables internally to hunt down problems including a breakdown
of all visits (supposedly) occurring in each calendar year in your system.
Thanks to those who have tested the previous version. Those who sent in
output, no need to re-send, I should be able to condense the big output
you sent to the shorter output this version generates. Those who were
put off either by file-size or by text fields, I believe I have
addressed the problems that were found, and look forward to your further
input.
/* Empirical Data Dictionary, v 0.2.0.0 */
/**
I originally started out writing a cohort-characterization script that could be
adapted to the needs of many different research project just by altering the
cohort selection criteria (since most cohort characterization needs seem to
follow the same basic pattern).
However, from UTHSCSA's own trial-and-error and from discussions on the mailing
lists it has become clear that for multi-site counts to get non-misleading
results we need to have a very detailed characterization of each site's I2B2
semantics within its CRC (data repository) cell. The immediate goals of this
script are:
* To create a sort of empirical data dictionary we can all refer to in
constructing federated queries that will run correctly at GPC sites as they
*currently* are, so that cohort characterization can proceed in parallel with
data standardization.
* To help those working on data standardization see what else needs to be done
and where it needs to be done the most.
* To make certain data quality issues easier to spot.
This script ONLY does counts-- no individual records are extracted into the
output tables. Also, no dates, patient_nums, or location information will end
up in the output tables.
This script was tested on SQL*Developer but will probably work in
a SQL*Plus console. The runtime is estimated to be about an hour.
INSTRUCTIONS:
1. Set the appropriate values for `dataschema` and `yoursite`. Please leave
`conceptinfo` as-is.
2. Run this script as a user who has select access to the schema you specified
as your `dataschema`. That is the only external schema you'll need for this
version of the script. The user running this script will also need
permissions to create tables in their own schema. No tables will be created
anywhere else. For some of the tables, SQL*Developer might output...
"Failed: Warning: execution completed with warning"
This might have something to do with ...
https://support.oracle.com/rs?type=bug&id=7526579
Long story short, the tables do get created, so disregard the warning.
3. After running the script, please either export the tables OUTPUT_CON_MOD,
OUTPUT_BASIC_DEMOG, and OUTPUT_PROVIDER to .csv files or view them in
SQL*Developer, select all, and copy-paste into spreadsheets.
4. Email to bo...@uthscsa.edu
Thank you very, very much. Please let me know if you want a copy the GPC-wide
results or if you have questions or suggestions.
**/
/** the name of the schema where you keep your DE-identified data
(e.g. OBSERVATION_FACT) **/
define dataschema = BLUEHERONDATA;
/** A string by which to identify your site **/
define yoursite = UTHSCSA;
-- to empirically find out what data-type each concept_cd
define conceptinfo = "valtype_cd,valueflag_cd,location_cd,units_cd";
-- values of TVAL_CHAR that are passed to the output unobfuscated
define std_tvals="'@','E','NEGATIVE','NOT APPLICABLE','G','SEE
NOTE','NON-REACTIVE','NO INFO','GE','L','POSITIVE','YES'";
/** temporary tables needed for faster creation of the concepts table **/
CREATE TABLE tmpcons AS (SELECT ccd
,listagg(conpath,';') WITHIN GROUP (ORDER BY NULL) conpath FROM
(SELECT DISTINCT concept_cd
ccd,regexp_substr(concept_path,'\\[^\\]+\\[^\\]+\\[^\\]+\\') conpath
FROM &dataschema..concept_dimension) GROUP BY ccd);
/* this is the slowest one, sorry about that */
CREATE TABLE tmpot1 AS (SELECT DISTINCT nvl(Regexp_Substr(obf.Concept_Cd,'^.*:')
,concept_cd) domain ,patient_num,start_date,obf.concept_cd,modifier_cd
/* If TVAL_CHAR is one of the code-like values, we pass it through as-is.
The upper/lower case is deliberately left as-is to help identify domains
where there are prevalent case-clones of what should be the same value */
,CASE WHEN upper(tval_char) IN (&std_tvals) THEN tval_char
ELSE REPLACE(REPLACE(regexp_replace(
/* However, if TVAL_CHAR is some long, complex string we first replace all the
numeric characters with 0s, then replace all the alphabetic characters with
Xs, and leave all spaces and punctuation as-is. Then, we replace 0.0 and .0
with 0. So, if you see 0 in the output later on, keep in mind that this
encompasses both integer and float values. Also, a datestamp might look like
0-0-0 0:0XX. Now, we are protected against sensitive information leaking
into the output and also against overly large output: for UTHSCSA data,
version 0.1.0 of this script produced and OBS_TEMP that was almost
70 million lines long. This one produces an OBS_TEMP that is 2658 lines
long! */
regexp_replace(tval_char,'[0-9]+',0),'[[:alpha:]]','X'),'0.0',0),'.0',0)
END tval_char, &conceptinfo
/* if NVAL_NUM is not null, we assume it is numeric. The raw value is not
returned of course */
,CASE WHEN nval_num IS NULL THEN 'None' ELSE 'Numeric' END nval, conpath
FROM &dataschema..observation_fact obf
LEFT JOIN tmpcons ON concept_cd = ccd);
-- 1517.7 seconds
/* this is the second slowest one, but then it's all faster from there on out */
-- drop table obs_temp;
CREATE TABLE obs_temp AS
/* the tmpot1 temporary table does all the value transformations and exclusion
of dupliates. The problem we're trying to solve is that TVAL_CHAR sometimes
acts as a code and sometimes as a freeform string... and this behavior is
governed by rules we don't know for certain (these rules are among the things
we are trying to discover by running this script). So we simplify TVAL_CHAR */
(SELECT domain,modifier_cd,tval_char,&conceptinfo,nval,count(*) n,conpath
FROM tmpot1 GROUP BY domain,modifier_cd,tval_char,&conceptinfo,nval,conpath);
-- 640.8 seconds
/* Create tmpot2 from ot1 by omitting TVAL_CHAR and UNITS_CD and summing over
the
individual counts, as per above. The ojoin column is for convenience, to join
tdeflate and udeflate below without spamming redundant 'ON' statements. The
nvl(...,'@|') are there so ambiguous strings aren't created when there are
adjacent NULL values*/
-- drop table tmpot2;
CREATE TABLE tmpot2 AS
(SELECT domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,sum(n) n
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
ojoin
FROM obs_temp where n <=10 GROUP BY
domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.062 seconds, 59 rows
/* Many of the combinations in OBS_TEMP are rather rare. We need to know that
they can occur, but they might not merit individual attention as much as
common occurences. The main culprits (at least at San Antonio, please let us
know if you find additional ones) are TVAL_CHAR and UNITS_CD. Since we're also
aggregating by the first three levels of the concept hierarchy, we do those
too. Here we take the rare occurences and aggregate all their unique
TVAL_CHAR, UNITS_CD, and CONPATH values into ';'-delimited strings. The Ns
are
aggregated by summing. We begin by creating a temporary table of just the
rare
value combinations, to hopefully speed up the query by reducing the
search-space. */
/* create from obs_temp by aggregating distinct UNITS_CD values */
-- drop table tdeflate;
create table tdeflate as (
SELECT domain,modifier_cd
,listagg(tval_char,';') WITHIN GROUP(ORDER BY NULL) tval_char
,valtype_cd,valueflag_cd,location_cd,nval
/* the following column is created for joining on ojoin, above */
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
tjoin
FROM (SELECT DISTINCT
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd,nval
FROM obs_temp)
GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.327 seconds, 126 rows
/* create from obs_temp by aggregating distinct UNITS_CD values */
create table udeflate as (
SELECT domain,modifier_cd
,listagg(units_cd,';') WITHIN GROUP(ORDER BY NULL) units_cd
,valtype_cd,valueflag_cd,location_cd,nval
/* the following column is created for joining on ojoin, above */
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
ujoin
FROM (SELECT DISTINCT
domain,modifier_cd,units_cd,valtype_cd,valueflag_cd,location_cd,nval
FROM obs_temp)
GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.08 seconds, 126 rows
/* create from obs_temp by aggregating distinct CONPATH values */
-- drop table cdeflate ;
create table cdeflate AS (
SELECT domain,modifier_cd
,substr(listagg(conpath,';') WITHIN GROUP(ORDER BY NULL),1,100) conpath
,valtype_cd,valueflag_cd,location_cd,nval
/* the following column is created for joining on ojoin, above */
,domain||nvl(modifier_cd,'@|')||nvl(valtype_cd,'@|')||nvl(valueflag_cd,'@|')||nvl(location_cd,'@|')||nval
cjoin
FROM (SELECT DISTINCT
domain,modifier_cd,units_cd,valtype_cd,valueflag_cd,location_cd,nval,conpath
FROM obs_temp)
GROUP BY domain,modifier_cd,valtype_cd,valueflag_cd,location_cd,nval);
-- 0.212 seconds, 126 rows
/* Now, take the non-rare concepts and append to them the above rare ones,
with TVAL_CHAR, UNITS_CD, and CONPATH joined back onto them. For MySQL users
wondering why we can't just use GROUP_CONCAT(DISTINCT ...), it's because
Oracle DOESN'T HAVE anything equivalent to that. I know, right? */
-- drop table deflated;
CREATE TABLE deflated AS (
SELECT * FROM obs_temp WHERE n > 10
union all
SELECT ot2.domain,ot2.modifier_cd,tdeflate.tval_char,ot2.valtype_cd
,ot2.valueflag_cd,ot2.location_cd,udeflate.units_cd,ot2.nval,n
,REPLACE(conpath,'i2b2','') conpath
FROM tmpot2 ot2 LEFT JOIN tdeflate ON ojoin = tjoin
LEFT JOIN udeflate ON ojoin = ujoin LEFT JOIN cdeflate ON ojoin = cjoin );
-- 0.352 seconds, 2658 rows
/* We're still not done; we need to add some information for interpreting what
the concept modifiers mean. The mods temporary table maps modifier paths and
names to CONCEPT_CD. At UTHSCSA MODIFIER_CD has a one-to-one relationship
with MODIFIER_PATH and NAME_CHAR but just in case this is violated at other
sites, we still do
LISTAGG(...) ... for them */
create table tmpmods as (
SELECT modifier_cd mcd
,listagg(modifier_path,';') WITHIN GROUP(ORDER BY NULL) modpath
,listagg(name_char,';') WITHIN GROUP(ORDER BY NULL) modname
FROM (
SELECT DISTINCT modifier_cd,modifier_path,name_char
FROM &dataschema..modifier_dimension WHERE modifier_cd IS NOT NULL)
GROUP BY modifier_cd);
-- 0.432, 94 rows
-- How many rows in obs_temp total, with > 10 occurences, and with <= 10
occurrences ?
SELECT 'Total rows' what, count(*) FROM obs_temp -- 5510
UNION ALL
SELECT 'N > 100' what, count(*) FROM obs_temp WHERE n > 10 -- 2599
UNION ALL
SELECT 'N <= 100' what, count(*) FROM obs_temp WHERE n <=10 --2911
;
/**
This is a table counting the number of occurrences for each combination of
CONCEPT_CD and its various attributes along with their paths and descriptions
where available. Some uses for this table:
* What MODIFIER_CDs can accompany a given type of CONCEPT_CD? For example at
UTHSCSA if
I run ...
SELECT MODIFIER_CD,sum(N) FROM OUTPUT_CON_MOD WHERE DOMAIN = 'ICD9:' GROUP BY
MODIFIER_CD;
...I find that every MODIFIER_CD associated with an 'ICD9:' concept either
looks like 'DiagObs:XXX' or 'PROBLEM_STATUS_C:X'. So I only need to worry
about understanding and properly filtering on this set of codes to properly
search diagnoses facts. I also notice that the single most commonly used
MODIFIER_CD is 'DiagObs:MEDICAL_HX', in other words, diagnoses that are part
of the patient's medical history at the time of the visit, rather than active
diagnoses. If I don't filter for this, my diagnosis counts are going to be
inflated!
* What text values exist for a given type of CONCEPT_CD?
SELECT TVAL_CHAR,sum(N) FROM OUTPUT_CON_MOD WHERE DOMAIN like '%ALLERGEN:%'
GROUP BY TVAL_CHAR;
For UTHSCSA, it turns out that allergy facts don't use TVAL_CHAR, one less
thing to worry about. What about medications? Replace '%ALLERGEN:%' with
'%MEDICATION_ID:%' in the above query and I find some '@', some 'E', some '0'
which means integers of various lengths, some 0-0, which means integer ranges,
and several semicolon-separated strings that indicate rare combinations of
values that have been binned together.
* So, under what circumstances to meds have a numeric range in their TVAL_CHAR?
SELECT modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,modname,sum(N)
FROM output_con_mod WHERE domain LIKE '%MEDICATION_ID:%' AND tval_char = '0-0'
group by modifier_cd,valtype_cd,valueflag_cd,location_cd,nval,modname;
At UTHSCSA all of these have modifiers indicating medication orders of various
types, and for all such observations the NVAL_NUM field is empty. On the
other
hand if we do the above query with 'E' instead of '0-0' we only get
observations whose modifiers are dosages, VALTYPE_CD='N' and the NVAL_NUM
fields are not empty (presumably contain numeric values). From this we can
conclude that for some medication orders, when the intended dose is a range,
it ends up int the TVAL_CHAR field instead of the NVAL_NUM field where you
would normally expect numeric values to end up.
* You can see what types of concepts are the most prevalent at a site... even if
their metadata tables are out of date because this table doesn't use
metadata, it's based directly on counts of OBSERVATION_FACT:
SELECT domain,SUM(n) N FROM output_con_mod GROUP BY domain ORDER BY N DESC;
DX_ID type diagnoses are more common than ICD9 type diagnoses (which makes
sense because DX_ID are more granular). Observations in the ALLERGEN domain
are relatively rare and ones in the MicroPositive domain are very rare
(at UTHSCSA).
* You can globally see what types of concepts can occur with what types of
modifiers:
SELECT domain,modifier_cd,SUM(n) n FROM output_con_mod
GROUP BY domain,modifier_cd ORDER BY domain;
Looks like observations in the PAT_ENC domain never have modifiers, and all
modifiers for the FAMILYHISTORYDIAG domain look like 'FamRelation:%' (and are
never '@').
* If you have CONPATHs that are null, it may mean that those
observatoins are orphaned and not visible from queries that depend on
metadata... but now you know about their existance and can remedy this and in
the meantime query for them in OBSERVATION_FACT
SELECT domain,sum(n) FROM output_con_mod WHERE conpath IS NULL GROUP BY
domain;
Looks like UTHSCSA has some work to do mapping lab, flowsheet, medication,
and
IC9 concepts.
**/
/* Okay, NOW we're done. We replace the ubiquitous and therefore uninformative
i2b2 prefix to save space, because some groups of concepts have many paths
and this column can get very long otherwise */
--DROP TABLE output_con_mod;
CREATE TABLE OUTPUT_CON_MOD AS
SELECT domain,modifier_cd
,CASE WHEN LENGTH(tval_char)>200 THEN substr(tval_char,1,200)||'...' ELSE
tval_char END tval_char
,valtype_cd,valueflag_cd,location_cd
,CASE WHEN LENGTH(units_cd)>200 THEN substr(units_cd,1,200)||'...' ELSE
units_cd END units_cd
,nval,n
,CASE WHEN LENGTH(regexp_replace(conpath,'\\i2b2\\',''))>200 THEN
substr(regexp_replace(conpath,'\\i2b2\\',''),1,200)||'...' ELSE conpath END
conpath
,modpath,modname
FROM deflated LEFT JOIN tmpmods ON modifier_cd = mcd
ORDER BY domain,n DESC
;
-- 0.549 seconds, 2725 rows
/* drop most of the temp tables */
DROP TABLE tmpcons; DROP TABLE tmpot1; DROP TABLE tmpot2;
DROP TABLE tdeflate; DROP TABLE udeflate; DROP TABLE cdeflate;
DROP TABLE deflated; DROP TABLE tmpmods;
/* Not dropping OBS_TEMP because it might be useful to you for troubleshooting
can drop it manually, or by uncommenting some or all of the cleanup block of
commands at the bottom of this script */
/* How many rows? */
SELECT 'All' what,count(*),sum(n) records FROM output_con_mod
UNION ALL
select 'Not distinguishing CONPATHs' what,count(*),sum(n) records from (SELECT
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,sum(n) n,modpath,modname FROM output_con_mod
GROUP BY domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,modpath,modname)
UNION ALL
select 'Omitting NULL CONPATHs and not distinguishing them' what,
count(*),sum(n) records from (SELECT
domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,sum(n) n,modpath,modname FROM output_con_mod
where conpath is not NULL
GROUP BY domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,modpath,modname)
UNION ALL
SELECT 'Only very large groups of CONPATHs' what,count(*),sum(n) records FROM
output_con_mod
where length(conpath) > 100
;
-- 2725 rows in output_con_mod, covering 68571937 observations
-- 1011 rows disregarding concept paths, covering 68571937 observations
-- 949 rows after first eliminating all null concept paths, covering 65394658
observations
-- 13 rows where the sum of concept path lengths > 100, covering 2242428
observations
/* Some interesting output for your own information...
-- can be used to investigate which types of observations span many different
CONPATHs
select * from output_con_mod;
SELECT domain,modifier_cd,tval_char,valtype_cd,valueflag_cd,location_cd
,units_cd,nval,n,substr(conpath,1,500) conpath,modpath,modname FROM
output_con_mod
where length(conpath) > 100;
-- can be used to investigate which types of observations aren't mapped to
CONPATHs
select * from output_con_mod where conpath is NULL;
-- break down length of conpath by frequency and number of observations
SELECT lc,count(*) nl,sum(n) nn FROM (SELECT LENGTH(conpath) lc,n FROM
output_con_mod)
GROUP BY lc ORDER BY lc;
*/
/** This counts all existing distinct combinations of demographic variables
in patient_dimension and breaks them down by visit year. The first 'year', 9998
actually represents ALL distinct patients in the database regardless of year.
(e.g. total number of living female white english-speakers aged 6-12, total
number of living male white english-speakers aged 6-12, and so on). The
remaining columns are likewise, but counts of patients from each group seen only
during those respective years. There is a method to the madness for how the
other years were chosen: 1988 is the earliest year any GPC site reports having
records. 2015 and 2016 haven't happened yet, so if you have any counts in those
years, you have a good reason to be curious. 9999 and 0 are the 'foo' and 'bar'
of four digit integers, and might have been used by someone to tag missing or
otherwise problematic records, so we use 9998 to code our catch-all year
instead.
some uses for this table:
* Spotting nonsensical combinations of demographic values, like patients in the
1-5 age-range who have a marital status of 'divorced'
* Spotting negative ages or missing ages
* Spotting demographic variables that aren't used at a given site... for example
at UTHSCSA, it looks like we don't currently use RELIGION_CD or INCOME_CD, so
so if someone sends us a query that filters on those, they won't get any
results!
* Seeing what terms are valid for the various columns of PATIENT_DIMENSION at
each site... if your query filters on `SEX_CD = 'f'` and a given site uses
values of 'Male','Female', and '@' for SEX_CD, that's another query that
won't
return results.
* Collapsing it into a more manageable pivot table and using it to see which
years have the most patients in the demographic groups that interest you. For
example, here's how you would do an age breakdown for the years 2012-2014:
SELECT * FROM output_basic_demog;
WITH dem0 AS (SELECT sex_cd,age_range,yr
,sex_cd||' '||age_range joinby
,SUM(n_patients) n_patients FROM
output_basic_demog GROUP BY sex_cd,age_range,yr),
d12 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2012),
d13 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2013),
d14 AS (SELECT joinby,n_patients FROM dem0 WHERE yr = 2014),
grp AS (SELECT DISTINCT sex_cd,age_range,joinby FROM dem0)
SELECT age_range,sex_cd
,d12.n_patients "2012",d13.n_patients "2013",d14.n_patients "2014"
FROM grp LEFT JOIN d12 ON grp.joinby = d12.joinby
LEFT JOIN d13 ON grp.joinby = d13.joinby
LEFT JOIN d14 ON grp.joinby = d14.joinby
order by age_range,sex_cd;
**/
/* temporary tables for creating output_basic_demog, created to avoid using
Oracle's non-portable WITH syntax */
CREATE TABLE visits0 AS (SELECT DISTINCT patient_num pn,start_date
,EXTRACT(YEAR FROM start_date) yr
FROM &dataschema..observation_fact);
-- 161.0 seconds, 10339581 rows (not necessarily that many visits!)
CREATE TABLE visits1 AS (SELECT pn,count(*) n_visits,yr FROM visits0 GROUP BY
pn,yr);
-- 24.2 seconds, 5365933 rows
CREATE TABLE visits2 AS (SELECT
n_visits,yr,language_cd,vital_status_cd,sex_cd,race_cd
,marital_status_cd,religion_cd,income_cd
,case when AGE_IN_YEARS_NUM is NULL then 'Missing'
when AGE_IN_YEARS_NUM < 0 then 'Negative'
when AGE_IN_YEARS_NUM < 1 then '< 1'
when AGE_IN_YEARS_NUM <= 5 then '1-5'
when AGE_IN_YEARS_NUM <= 12 then '6-12'
when AGE_IN_YEARS_NUM <= 18 then '13-18'
when AGE_IN_YEARS_NUM <= 65 then '19-65'
WHEN AGE_IN_YEARS_NUM <= 85 THEN '66-85'
ELSE '>85' END age_range
FROM visits1
LEFT JOIN &dataschema..patient_dimension ON pn = patient_num);
-- 30.5 seconds, 5365933 rows
CREATE TABLE visits3 AS (
SELECT sum(n_visits) n_visits,count(*) n_patients,yr,language_cd
,vital_status_cd,sex_cd,race_cd,marital_status_cd,religion_cd,income_cd
,age_range
FROM visits2 GROUP BY yr,language_cd,vital_status_cd,sex_cd,race_cd
,marital_status_cd,religion_cd,income_cd,age_range
);
-- 9.9 seconds, 55175 rows
-- this part is a count over all the years
CREATE TABLE allyears AS (SELECT sum(n_visits) n_visits,count(*)
n_patients,9998 yr
,language_cd,vital_status_cd,sex_cd,race_cd,marital_status_cd,religion_cd
,income_cd,age_range
FROM visits2 GROUP BY language_cd,vital_status_cd,sex_cd,race_cd
,marital_status_cd,religion_cd,income_cd,age_range);
-- 8.7 seconds, 3982 rows
-- DROP TABLE output_basic_demog;
CREATE TABLE OUTPUT_BASIC_DEMOG AS
(SELECT * FROM visits3 WHERE n_patients > 10
UNION ALL
SELECT * FROM allyears WHERE n_patients > 10);
-- 0.213 seconds, 19957 rows
/* clean up the temporary demog tables */
DROP TABLE visits0; DROP TABLE visits1; DROP TABLE visits2; DROP TABLE visits3;
DROP TABLE allyears;
/* So, when is your data observed? */
SELECT 'Total visits' what, sum(n_visits) N
FROM output_basic_demog where yr != 9998 -- 20469187
UNION ALL
SELECT 'Visits supposedly before 1950' what, sum(n_visits) N
from output_basic_demog where yr <= 1950 -- 261554
UNION ALL
SELECT 'Visits supposedly between 1950 and 1999' what, sum(n_visits) N
from output_basic_demog where yr >= 1950 and yr < 1999 -- 1114215
UNION ALL
SELECT 'Visits between 2000 and 2006' what, sum(n_visits) N
FROM output_basic_demog WHERE yr >= 2000 AND yr < 2006 -- 134181
UNION ALL
SELECT 'Visits between 2006 and present' what, sum(n_visits) N
FROM output_basic_demog WHERE yr >= 2006 AND yr <= 2014 -- 8632197
UNION ALL
SELECT 'Visits supposedly in the future' what, sum(n_visits) N
FROM output_basic_demog WHERE yr > 2014 and yr != 9998 -- 77
;
select yr,sum(n_visits) from output_basic_demog group by yr order by yr;
/** Surveying to see for what if anything people are using the columns within
VISIT_DIMENSION. Might identify sites where it's possible to query by clinic or
by visit type.
**/
create table OUTPUT_VISIT as
select '&yoursite' site,ACTIVE_STATUS_CD,INOUT_CD
,LOCATION_CD,LOCATION_PATH
,min(LENGTH_OF_STAY)+0 minstay,max(LENGTH_OF_STAY)+0 maxstay,count(*) N
from &dataschema..visit_dimension
group by ACTIVE_STATUS_CD,INOUT_CD,LOCATION_CD,LOCATION_PATH;
/** If your sites has provider information, this counts distinct patients by
provider. At least, that's what I hope it does: UTHSCSA's PROVIDER_DIMENSION is
currently empty. :-(
We now replace all character strings with runs of Xs of equal lengths, and all
numbers with 0, leaving spaces and punctuation in place.
**/
-- drop table output_provider;
create table OUTPUT_PROVIDER as
SELECT '&yoursite' site,count(*) patients
,MIN(
regexp_replace(regexp_replace(provider_path,'[[:alpha:]]','X'),'[0-9]+',0))
provider_path
,MIN( regexp_replace(regexp_replace(name_char,'[[:alpha:]]','X'),'[0-9]+',0))
name_char
from (select distinct provider_id pi,patient_num
from &dataschema..observation_fact) obs
left join &dataschema..provider_dimension prv
on obs.pi = prv.provider_id
where obs.pi is not NULL
group by pi;
/*** Thank you for your help in helping us understand how the various fields in
I2B2 are currently used by each of the sites. Please look over the four tables
whose names start with OUTPUT_, and once you're satisfied that they are okay
to share, dump them to CSV and email to bo...@uthscsa.edu
...and if you see a problem, DON'T email them to me, but please DO let me know
what the problem is so I can fix it!
Food for thought: whatever you currently have in TVAL_CHAR is visible through
the ordinary I2B2 web interface in timeline view
***/
/** to delete the tables not already deleted above, uncomment the following as
needed**/
--drop table OBS_TEMP;
--drop table OUTPUT_CON_MOD;
--drop table OUTPUT_BASIC_DEMOG;
--drop table OUTPUT_PROVIDER;
--drop table OUTPUT_VISIT;
_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev