Example (Re: Empirical Data Dictionary)

2014-10-06 Thread Alex Bokov
Kind thanks to KUMC and the sites that volunteered to test.

With help from Wisconsin and MCW so far, I have a lot of revisions to
add to the original script that make the output smaller, the syntax less
Oracle-specific, and eliminate or scrub certain fields. If you're not
one of the test sites, you might want to hold out for the revised script
that I'll send out after having had time to digest the initial reports.
In response to some very valid concerns raised, the raw count tables
will not be released without explicit consent from the originating
sites. Instead, I propose to append to the enclosed table (currently
representing UTHSCSA data) and send that out, along with other summary
metrics that people might suggest as useful to data standardization and
quality control. Is everyone okay with that?

I'll have analogous summaries of the other three tables, but those are
still in preparation.

Here is the information I was able to obtain from running the scripts I
sent earlier on our own data in San Antonio. In 33 lines we get a wealth
of hints about how to make queries work as intended at our site, and
things we need to improve. Part of this was done by aggregation of
OUTPUT_CON_MOD within SQL, but part of it was just manual tinkering with
autofilters on a spreadsheet and describing what I saw. Here is what
each column means and why one should care:

  * *site*: Site for which this row of data was generated
  * *domain*: A consistent substring shared by a group of related
CONCEPT_CDs (concept codes). These should be all listed in the
YOURMETADATA.SCHEMES table, but I2B2 has no way of enforcing this
rule, so instead I got these from the CONCEPT_CDs that are actually
there.
  * *tval_char*: Usually this field to contain categoric values
(ventral/dorsal/lateral, left/right,
high/middle/low/below-threshold, etc.) but it often ends up with a
lot of free-text values dumped into it. In San Antonio data,
COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts
use it as a time-stamp (but the time portions are all 0's), and
MEDICATION_ID facts use it to indicate dose ranges (still need to
look into whether these co-occur with NVAL_NUM fields, and if so,
how often the values fail to match). _Recommendations:_ field might
be useful for chronologically ordering multi-day FLOW_MEAS_ID
observations and quality-checking MEDICATION_ID observations.
Further study of lab-value-specific usages of TVAL_CHAR may be
needed before lab-values that use this field can be relied on either
as raw data or as filtering criteria. Certain values in this field
can indicate a numeric fact, but VALTYPE_CD is a more reliable
indicator.
  * *modifier_cd*: Here are listed the observed rules for what (if any)
MODIFIER_CDs can occur in observations from each domain. Currently
the only domains in San Antonio data that use modifiers are
diagnoses (ICD9 and DX_ID), allergies, medications, procedure
orders, and family history. Labs also use modifiers, but these are
rarely useful since they are hard-coded to be medians and last
observations regardless of the size, time-span, or intended analysis
of the lab value. _Recommendations_: For queries on diagnoses be
mindful of whether you want only medical history, only active
diagnoses, both, or just primary diagnoses [which in turn raises the
question of how to reliably identify those]. Either way, to avoid
duplicate counts, be sure to filter on MODIFIER_CD. For queries on a
specific allergic reaction, use MODIFIER_CD. For queries on an
allergen regardless of reaction type, leave MODIFIER_CD out, and use
the appropriate SELECT DISTINCT ... GROUP BY ... subquery. For
medications, the Inpatient/Outpatient/Historical/PRN/Other set of
modifiers can co-exist with each other and with the Dose/Dose_MAR
ones on the same drug during the same encounter. Pulling only facts
with Dose or Dose_MAR modifiers /might/ be enough to prevent
redundancies, but not yet certain. For procedure orders filter on
MODIFIER_CD = '@' if you want all procedures, or if you want just
procedures of type XX 'MODIFIER_CD = 'PROCORDERS:XX', but do not
ignore these modifiers or you will get redundant values. For family
history always use modifiers and treat the combination of modifier
and concept code as a distinct column OR if you don't care about
what type of family history, just filter on MODIFIER_CD !=
'FamRelation:18' and then have the appropriate SELECT DISTINCT ...
GROUP BY ... subquery (or for confirmed negative family histories,
MODIFIER_CD = 'FamRelation:18').
  * *valtype_cd*: Only lab results (COMPONENT_ID), flowsheets
(FLOW_MEAS_ID), meds (MEDICATION_ID), vitals (PAT_ENC),
PACK_PER_DAY, and TOBACCO_USED_YEARS use this field but the latter
two always have the value set to 'N'. _Recommendations_: can be used
to 

RE: Example (Re: Empirical Data Dictionary)

2014-10-06 Thread Dan Connolly
I started trying out the script last week.

1st bit of feedback: It was taking a lot longer than the estimated 20 minutes. 
Just building the 1st intermediate table took longer than that.

I have since lost my sesson/context (had to reboot my desktop... sigh...)

I guess my high order feedback is: which end is up? What's the goal here? I was 
expecting something related to cohort characterization, but if it is related, I 
don't understand how. Perhaps it's clear/obvious to others.

The script comment begins:
I originally started out writing a cohort-characterization script that could be
adapted to the needs of many different research project just by altering the
cohort selection criteria (since most cohort characterization needs seem to
follow the same basic pattern).

However, from UTHSCSA's own trial-and-error and from discussions on the mailing
lists it has become clear that for multi-site counts to get non-misleading
results we need to have a very detailed characterization of each site's I2B2
semantics within its CRC (data repository) cell.
I'm interested to know more about the trial-and-error steps.

I've seen pretty good cohort characterization progress w.r.t. breast cancer and 
ALS based on HERON data builder results (which are analagous to i2b2 PDO 
results as seen in the timeline plug-in). I don't understand why that approach 
isn't suitable.

The script comment continues:

* To create a sort of empirical data dictionary we can all refer to in
  constructing federated queries that will run correctly at GPC sites as they
  *currently* are, so that cohort characterization can proceed in parallel with
  data standardization.
How is this emperical data dictionary different from babel? We've been 
proceeding in parallel, no? We do manual terminology alignment in some cases 
(e.g. breast cancer procedures for #119) while we work out the details of 
automated terminology alignment (#160).

* To help those working on data standardization see what else needs to be done
  and where it needs to be done the most.
* To make certain data quality issues easier to spot.

Data quality issues such as... what?

We certainly have a long list of things to be done for data standardization. 
Can you give an example of how this script helps with determining where it 
needs to be done most?

About the items below...

For domain, why not the first part of the concept path?

I wonder how free text got into tval_char; the HERON ETL scripts are fairly 
conservative about avoiding free text.

Structuring a list of recommendations around modifier_cd and such seems odd to 
me.

--
Dan



From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of Alex Bokov [bo...@uthscsa.edu]
Sent: Monday, October 06, 2014 12:57 PM
To: gpc-dev@listserv.kumc.edu
Subject: Example (Re: Empirical Data Dictionary)

Kind thanks to KUMC and the sites that volunteered to test.

With help from Wisconsin and MCW so far, I have a lot of revisions to add to 
the original script that make the output smaller, the syntax less 
Oracle-specific, and eliminate or scrub certain fields. If you're not one of 
the test sites, you might want to hold out for the revised script that I'll 
send out after having had time to digest the initial reports. In response to 
some very valid concerns raised, the raw count tables will not be released 
without explicit consent from the originating sites. Instead, I propose to 
append to the enclosed table (currently representing UTHSCSA data) and send 
that out, along with other summary metrics that people might suggest as useful 
to data standardization and quality control. Is everyone okay with that?

I'll have analogous summaries of the other three tables, but those are still in 
preparation.

Here is the information I was able to obtain from running the scripts I sent 
earlier on our own data in San Antonio. In 33 lines we get a wealth of hints 
about how to make queries work as intended at our site, and things we need to 
improve. Part of this was done by aggregation of OUTPUT_CON_MOD within SQL, but 
part of it was just manual tinkering with autofilters on a spreadsheet and 
describing what I saw. Here is what each column means and why one should care:

  *   site: Site for which this row of data was generated
  *   domain: A consistent substring shared by a group of related CONCEPT_CDs 
(concept codes). These should be all listed in the YOURMETADATA.SCHEMES table, 
but I2B2 has no way of enforcing this rule, so instead I got these from the 
CONCEPT_CDs that are actually there.
  *   tval_char: Usually this field to contain categoric values 
(ventral/dorsal/lateral, left/right, high/middle/low/below-threshold, etc.) but 
it often ends up with a lot of free-text values dumped into it. In San Antonio 
data, COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts use it 
as a time-stamp (but the time portions are  all 0's), and MEDICATION_ID facts 
use it 

RE: Updated milestone report from todays GPC-DEV call; 4Public

2014-10-06 Thread Dan Connolly
Please copy gpc-dev when you submit it, John.

--
Dan


From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of John Steinmetz [jsteinm...@kumc.edu]
Sent: Tuesday, September 30, 2014 1:24 PM
To: 'Campbell, James R'; gpc-dev@listserv.kumc.edu
Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public

Fantastic.  Thanks to all.

I will get it submitted to PCORI.

John.


From: Campbell, James R [mailto:campb...@unmc.edu]
Sent: Tuesday, September 30, 2014 1:08 PM
To: gpc-dev@listserv.kumc.edu; John Steinmetz
Subject: Updated milestone report from todays GPC-DEV call; 4Public


All

I believe I made all the stated changes to personnel...please check me out



John

use this copy for preparing your report

Jim

The information in this e-mail may be privileged and confidential, intended 
only for the use of the addressee(s) above. Any unauthorized use or disclosure 
of this information is prohibited. If you have received this e-mail by mistake, 
please delete it and immediately contact the sender.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: Updated milestone report from todays GPC-DEV call; 4Public

2014-10-06 Thread John Steinmetz
Here is the linkhttps://pcornet.centraldesktop.com/c4gpc/file/34887855/ to 
the deliverable on CDT.  It is one document that contains 2.1, 2.5, and 2.7.

2.3 was previously submitted, and can be found on CDT at this 
linkhttps://pcornet.centraldesktop.com/c4gpc/file/32487317/.

John.


From: Dan Connolly
Sent: Monday, October 06, 2014 1:24 PM
To: John Steinmetz; 'Campbell, James R'; gpc-dev@listserv.kumc.edu
Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public

Please copy gpc-dev when you submit it, John.

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of John Steinmetz 
[jsteinm...@kumc.edu]
Sent: Tuesday, September 30, 2014 1:24 PM
To: 'Campbell, James R'; 
gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public
Fantastic.  Thanks to all.

I will get it submitted to PCORI.

John.


From: Campbell, James R [mailto:campb...@unmc.edu]
Sent: Tuesday, September 30, 2014 1:08 PM
To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu; John Steinmetz
Subject: Updated milestone report from todays GPC-DEV call; 4Public


All

I believe I made all the stated changes to personnel...please check me out



John

use this copy for preparing your report

Jim

The information in this e-mail may be privileged and confidential, intended 
only for the use of the addressee(s) above. Any unauthorized use or disclosure 
of this information is prohibited. If you have received this e-mail by mistake, 
please delete it and immediately contact the sender.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


Re: [gpc-informatics] #158: usable view of LOINC lab terms

2014-10-06 Thread GPC Informatics
#158: usable view of LOINC lab terms
-+
 Reporter:  rwaitman |   Owner:  budh0007
 Type:  enhancement  |  Status:  assigned
 Priority:  major|   Milestone:  data-domains2
Component:  data-stds|  Resolution:
 Keywords:   |  Blocked By:
 Blocking:  68   |
-+
Changes (by dconnolly):

 * milestone:   = data-domains2


--
Ticket URL: 
http://informatics.gpcnetwork.org/trac/Project/ticket/158#comment:3
gpc-informatics http://informatics.gpcnetwork.org/
Greater Plains Network - Informatics
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


gpc-dev agenda 7 Oct

2014-10-06 Thread Dan Connolly
I don't expect we'll get through all of this, so come prepared with input on 
the order...
Tues 11amCT:

  1.  Convene, take roll, review records and plan next meeting
 *   ​Meeting ID and access code: 
686-845-717https://global.gotomeeting.com/meeting/join/686845717; call +1 
(267) 507-0008
 *   meeting notes 
(#12https://informatics.gpcnetwork.org/trac/Project/ticket/12): 30 Sep notes 
OK? today's scribe: UTHSCSA
 *   roll: all 10 
DevTeamshttps://informatics.gpcnetwork.org/trac/Project/wiki/DevTeams 
represented? comments on the agenda?
KUMC, CMH, UIOWA, WISC, MCW, MCRF, UMN, UNMC, UTHSCSA, UTSW
*   anything to discuss on usable view of LOINC lab terms 
#158https://informatics.gpcnetwork.org/trac/Project/ticket/158?
*   new/closed/reopened tickets: quite a few
 *   next meeting: 14 Oct. scribe volunteer?
  2.  
HackathonTwo?https://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo
 (cf 
HackathonOnehttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonOne)
  3.  data builder 
(#87https://informatics.gpcnetwork.org/trac/Project/ticket/87)
 *   demo for Henderson @ uky.edu
 *   sync up with George, Phillip on deployment 
(#87https://informatics.gpcnetwork.org/trac/Project/ticket/87 etc.)
 *   discuss connections to cohort characterization 
(MethodsCorehttps://informatics.gpcnetwork.org/trac/Project/wiki/MethodsCore)
  4.  ​Empirical Data 
Dictionaryhttp://listserv.kumc.edu/pipermail/gpc-dev/2014q4/000567.html Alex 
Bokov (re #132https://informatics.gpcnetwork.org/trac/Project/ticket/132, 
#106https://informatics.gpcnetwork.org/trac/Project/ticket/106)
  5.  #173https://informatics.gpcnetwork.org/trac/Project/ticket/173 (REDCap 
for ALS survey at each GPC site) poll
  6.  
KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking:
 Breast Cancer Cohort Selection needs Tumor Registry in i2b2 by Nov 15 
milestone:bc-survey-cohort-defhttps://informatics.gpcnetwork.org/trac/Project/milestone/bc-survey-cohort-def
 *   #168https://informatics.gpcnetwork.org/trac/Project/ticket/168, 
#169https://informatics.gpcnetwork.org/trac/Project/ticket/169, 
#170https://informatics.gpcnetwork.org/trac/Project/ticket/170, 
#171https://informatics.gpcnetwork.org/trac/Project/ticket/171 tumor registry 
counts not available
 *   #167https://informatics.gpcnetwork.org/trac/Project/ticket/167 
(Breast Cancer Cohort Selection Criteria) created by bchrischilles
  7.  
KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking:
 ALS Survey in Dec 2014 
milestone:survey-redcap-alshttps://informatics.gpcnetwork.org/trac/Project/milestone/survey-redcap-als
 *   #159https://informatics.gpcnetwork.org/trac/Project/ticket/159 (GPC 
REDCap Service) created
 *   #174https://informatics.gpcnetwork.org/trac/Project/ticket/174 
(federated login for GPC data store) created
  8.  
KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking:
 PCORNet DRN (CDM) query readiness
 *   milestone for popmednet tickets? 
(#154https://informatics.gpcnetwork.org/trac/Project/ticket/154 ... 
#166https://informatics.gpcnetwork.org/trac/Project/ticket/166)
 *   note longer-term goal 
#160https://informatics.gpcnetwork.org/trac/Project/ticket/160, 
milestone:data-domains2https://informatics.gpcnetwork.org/trac/Project/milestone/data-domains2
  9.  age queries 
ticket:67#comment:11https://informatics.gpcnetwork.org/trac/Project/ticket/67#comment:11
 Sep 2

--
Dan

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: pattern of cohort characterization needs? RE: Empirical Data Dictionary

2014-10-06 Thread Dan Connolly
Nice.

For bonus points, update the obesity data elements ticket 
(#33https://informatics.gpcnetwork.org/trac/Project/ticket/33) to note this 
pattern in general and the obesity aspects in particular. For sensitive 
material, put it in the KUMC REDCap project I recently invited you to:

  *   GPC Cohort Characterization 
Workhttps://redcap.kumc.edu/redcap_v5.7.7/ProjectSetup/index.php?pid=3560

and point to it from #33

--
Dan


From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of Alex Bokov [bo...@uthscsa.edu]
Sent: Monday, October 06, 2014 1:45 PM
To: gpc-dev@listserv.kumc.edu
Subject: Re: pattern of cohort characterization needs? RE: Empirical Data 
Dictionary

I don't claim this covers all cases, only the ones I can think of so far. If 
anyone can think of a cohort characterization question that cannot be answered 
by the below procedure, I am interested in learning about it.

On 10/03/2014 04:56 PM, Dan Connolly wrote:

most cohort characterization needs seem to
follow the same basic pattern

What pattern is that?
Preliminary Cohort Characterization
1. Elicit from the domain experts minimal criteria for membership of a patient 
in the cohort of interest (i.e. cast a wide net)
2. Elicit from the domain experts facts at the visit level, that are of 
interest about those patients
3. Pull down all available demographic data for that patient set
4. Left join the above to a column containing the total visit count for each 
patient broken up by year
5. For each fact from #2 join an additional column with the visit count for 
each patient  You now have one row for each year each patient is in the system, 
with a separate column for each static value for that patient, a column for the 
total number of visits they had that year, and an additional column for each 
subset of those visits your domain experts flagged as possibly interesting.
6a. For deliverables asking for the number of distinct patients meeting a 
certain criterion, COUNT all the visit counts grouping by every demographic 
variable of interest and year.
6b. For deliverables asking for the number of distinct visits meeting a certain 
criterion, SUM all the visit counts grouping by every demographic variable of 
interest and year.
6c. If you want totals over all years in the system, for visits just SUM up the 
years. For patients, SELECT DISTINCT patients, demographic variables, and 
indicator variables for whether the number of visits in each category is 0 or 1 
(omit years this time). Or, do #5 but omit year in the first place.
7. Filter OBSERVATION_FACT by membership of PATIENT_NUM in the patient-set from 
#1 and then do a count of visits and/or of patients for each CONCEPT_CD 
(filtered in a domain-appropriate manner on MODIFIER_CD).

6a and 6b Tell you whether it's feasible to require that certain observations 
be present for each visit or each patient (i.e. that if you did that, your 
inclusion criteria would not so strict that you'd up with an insufficient 
sample size). They also give you an idea of your cohort's demographic makeup 
and how/if it has changed over time.

7 tells you what the most common facts are for this preliminary cohort, even if 
they were not singled out by the domain experts. In consultation with them,  
additional selection criteria might be drawn.

Refinements of Cohort Characterization
1. Optionally tighten the membership criteria (e.g. in our initial 
characterization it looks like most patients who have one XYZ measurement on 
file have half a dozen of them, so might as well make that the floor) and 
optionally limit the time range (e.g. initial characterization indicates we 
have large samples available between 2010-2013, so let's use only those years 
to begin with).
2. Optionally revise the visit-level features of interest (e.g. A procedure 
hardly ever gets ordered? Omit it this time. A drug you weren't aware of turns 
out to be prescribed to 30% of the patients? Dedicate a new column to it.). 
This may be the place to put in complex temporal queries so you aren't grinding 
the server on a huge dataset needlessly.
3. Are some of the original demographic variables too sparse for this cohort, 
or not used at all? Optionally omit them.
4-7. As above.

Repeat as necessary (I expect one iteration to be enough in many cases) until 
the clinicians and informaticians converge on a patient-set and visit-set of 
adequate size and relevant to the clinical problem of interest.

You'll notice that there is variability from study to study in two places:
A. The 'WHERE' clause for selecting the patient-set.
B. The 'WHERE' clause in each variable column.
Everything else is looking like it could be factored out into a generic query 
or procedure.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


Re: Example (Re: Empirical Data Dictionary)

2014-10-06 Thread Alex Bokov
First of all, I feel like I'm treading into sensitive territory here, in 
that where I point out on problems it can be misinterpreted as 
criticizing other people's work. So let me just re-affirm:


We are all geeks here. It's us against the bugs. When we engage in 
debate, both sides win by finding the optimal answer and it doesn't 
matter whose answer it turns out to be. Both sides lose by talking past 
each other and taking things personally. I admire your work Dan, and the 
team you put together, and y'all are a role model to us and hopefully to 
other sites. I am grateful to you guys for sharing the Heron ETL code. 
Adapting it has saved us YEARS of  effort, and it was and is far better 
than anything I can even imagine writing from scratch (and I tried, 
before I came to understand the Zen of Heron).


This is me trying, in a small way, to give something back. Unfortunately 
that something is verbose and probably annoying to read. Sorry. I spent 
most of the day trying to file the rough edges off this and probably 
didn't get most of them. Anyway, the point is, I mean it constructively.


On 10/06/2014 01:20 PM, Dan Connolly wrote:

I started trying out the script last week.

1st bit of feedback: It was taking a lot longer than the estimated 20 
minutes. Just building the 1st intermediate table took longer than that.
Sorry. Revised script in progress. I'm guessing the big resource hog was 
OUTPUT_CON_MOD? In the revised script, I filter the result-set on `WHERE 
N_VISITS  10` and that cuts down its size by about 50%.
I guess my high order feedback is: which end is up? What's the goal 
here? I was expecting something related to cohort characterization, 
but if it is related, I don't understand how. Perhaps it's 
clear/obvious to others.
We can't do cohort characterization or any useful queries at all if we 
don't understand which modifiers matter and which don't, which values 
can occur and which cannot, which type of data are always used and which 
aren't, which of our assumptions about each field in OBSERVATION_FACT 
are valid and which aren't. And, how much of this differs from site to site.


The parent post to this one was my attempt to clarify what I was trying 
to accomplish. If I can get OUTPUT_CON_MOD from each site (abridged, to 
drop the 10 occurence lines) I can munge it all into this one sheet, 
and report back to the list these answers for all the sites obtained by 
inspection of this sheet.

I'm interested to know more about the trial-and-error steps.
Something like this: this seems like a query will get the data we're 
looking for. Wait, how come we have duplicate entries? Oh, I see, you 
can have multiple encounters per calendar day. Work around it. Wait, how 
come we have duplicate entries? Oh, I see, a procedure order with '@' in 
its MODIFIER_CD field will always be accompanied by a second identical 
procedure order with a MODIFIER_CD that specifies whether it is 
Inpatient or Outpatient. Maybe a bug, maybe a feature, but I guess I 
better work around it in my queries. Wait, how come we have duplicate 
entries? Oh, I see diagnoses have a similar rule governing MODIFIER_CD 
usage but more complicated. Wait, why do some diagnoses codes look 
different from others? Oh, I see, the DX_ID ones are more specific than 
the ICD9 ones, and you have to do some text parsing to retrieve ICD9 
codes from them. Wait, how come we STILL have duplicate entries... 
...screw this, we'll be at this forever and still not notice the more 
subtle/obscure stuff. Let's just count every combination of non-numeric, 
non-free-text values (this was before realizing just how variable 
TVAL_CHAR and UNITS_CD are) and in one shot see what is a common case, 
what is a rare case, and what is a nearly impossible case. And now, 
let's see if it's even remotely similar to what's going on at other sites.
I've seen pretty good cohort characterization progress w.r.t. breast 
cancer and ALS based on HERON data builder results (which are 
analagous to i2b2 PDO results as seen in the timeline plug-in). I 
don't understand why that approach isn't suitable.
I haven't had a chance to look at the ALS script they posted yet, I need 
to do that, I may learn some useful stuff. Are you saying there is now a 
way to get PDOs or data builder results into a denormalized tabular 
format that can be analyzed in SAS or R? Because that was the main 
problem I encountered before meeting with success just using SQL. I 
decided that I'd fall back on R only after I hit a problem SQL couldn't 
handle, and so far this hasn't happened.


Also, much as I like R, there is a lot more SQL expertise in the GPC 
than R expertise, and everyone already has some type of SQL database 
because I2B2 requires it. So a suitably dialect-independent SQL script 
is easier to deploy than convincing everyone to add R to their toolchain.

The script comment continues:

* To create a sort of empirical data dictionary we can all refer
to in
  constructing 

RE: Example (Re: Empirical Data Dictionary)

2014-10-06 Thread Dan Connolly
What you've written is entirely responsive to my questions. As I say, I'm 
largely ignorant of this whole field (emperical data validation), so I 
appreciate the being educated by way of verbose specifics.

I think I have a few substantive questions in response, but I want to read over 
what you wrote a couple more times.

--
Dan


From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of Alex Bokov [bo...@uthscsa.edu]
Sent: Monday, October 06, 2014 5:05 PM
To: gpc-dev@listserv.kumc.edu
Subject: Re: Example (Re: Empirical Data Dictionary)

First of all, I feel like I'm treading into sensitive territory here, in that 
where I point out on problems it can be misinterpreted as criticizing other 
people's work. So let me just re-affirm:

We are all geeks here. It's us against the bugs. When we engage in debate, both 
sides win by finding the optimal answer and it doesn't matter whose answer it 
turns out to be.  ...

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


R, SQL, i2b2, and governance RE: Example (Re: Empirical Data Dictionary)

2014-10-06 Thread Dan Connolly
(Please excuse the awkward top-posting format; I'm stuck with Microsoft 
Outlook.)

Perhaps we're converging... the new Data Builder 
codehttps://informatics.gpcnetwork.org/trac/Project/ticket/134#comment:4 
delivers an sqlite3 file, so you can continue to use SQL to analyze it; and if 
you like python or Java better than R for post-SQL work, that's fine too.

But note that each Data Builder result is based on an i2b2 patient set that 
came from and audited i2b2 query. We don't have governance to let investigators 
run arbitrary SQL queries on our whole clinical data warehouse and we don't 
plan to (neither KUMC HERON nor GPC). For the 3 initial cohorts, we can get 
away with ad-hoc one-off work, but for GPC work in general, we plan do use i2b2 
to do as much of the querying as we can.


--
Dan



From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on 
behalf of Alex Bokov [bo...@uthscsa.edu]
Sent: Monday, October 06, 2014 5:05 PM
To: gpc-dev@listserv.kumc.edu
Subject: Re: Example (Re: Empirical Data Dictionary)

[...]
I've seen pretty good cohort characterization progress w.r.t. breast cancer and 
ALS based on HERON data builder results (which are analagous to i2b2 PDO 
results as seen in the timeline plug-in). I don't understand why that approach 
isn't suitable.
I haven't had a chance to look at the ALS script they posted yet, I need to do 
that, I may learn some useful stuff. Are you saying there is now a way to get 
PDOs or data builder results into a denormalized tabular format that can be 
analyzed in SAS or R? Because that was the main problem I encountered before 
meeting with success just using SQL. I decided that I'd fall back on R only 
after I hit a problem SQL couldn't handle, and so far this hasn't happened.

Also, much as I like R, there is a lot more SQL expertise in the GPC than R 
expertise, and everyone already has some type of SQL database because I2B2 
requires it. So a suitably dialect-independent SQL script is easier to deploy 
than convincing everyone to add R to their toolchain.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev