Example (Re: Empirical Data Dictionary)
Kind thanks to KUMC and the sites that volunteered to test. With help from Wisconsin and MCW so far, I have a lot of revisions to add to the original script that make the output smaller, the syntax less Oracle-specific, and eliminate or scrub certain fields. If you're not one of the test sites, you might want to hold out for the revised script that I'll send out after having had time to digest the initial reports. In response to some very valid concerns raised, the raw count tables will not be released without explicit consent from the originating sites. Instead, I propose to append to the enclosed table (currently representing UTHSCSA data) and send that out, along with other summary metrics that people might suggest as useful to data standardization and quality control. Is everyone okay with that? I'll have analogous summaries of the other three tables, but those are still in preparation. Here is the information I was able to obtain from running the scripts I sent earlier on our own data in San Antonio. In 33 lines we get a wealth of hints about how to make queries work as intended at our site, and things we need to improve. Part of this was done by aggregation of OUTPUT_CON_MOD within SQL, but part of it was just manual tinkering with autofilters on a spreadsheet and describing what I saw. Here is what each column means and why one should care: * *site*: Site for which this row of data was generated * *domain*: A consistent substring shared by a group of related CONCEPT_CDs (concept codes). These should be all listed in the YOURMETADATA.SCHEMES table, but I2B2 has no way of enforcing this rule, so instead I got these from the CONCEPT_CDs that are actually there. * *tval_char*: Usually this field to contain categoric values (ventral/dorsal/lateral, left/right, high/middle/low/below-threshold, etc.) but it often ends up with a lot of free-text values dumped into it. In San Antonio data, COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts use it as a time-stamp (but the time portions are all 0's), and MEDICATION_ID facts use it to indicate dose ranges (still need to look into whether these co-occur with NVAL_NUM fields, and if so, how often the values fail to match). _Recommendations:_ field might be useful for chronologically ordering multi-day FLOW_MEAS_ID observations and quality-checking MEDICATION_ID observations. Further study of lab-value-specific usages of TVAL_CHAR may be needed before lab-values that use this field can be relied on either as raw data or as filtering criteria. Certain values in this field can indicate a numeric fact, but VALTYPE_CD is a more reliable indicator. * *modifier_cd*: Here are listed the observed rules for what (if any) MODIFIER_CDs can occur in observations from each domain. Currently the only domains in San Antonio data that use modifiers are diagnoses (ICD9 and DX_ID), allergies, medications, procedure orders, and family history. Labs also use modifiers, but these are rarely useful since they are hard-coded to be medians and last observations regardless of the size, time-span, or intended analysis of the lab value. _Recommendations_: For queries on diagnoses be mindful of whether you want only medical history, only active diagnoses, both, or just primary diagnoses [which in turn raises the question of how to reliably identify those]. Either way, to avoid duplicate counts, be sure to filter on MODIFIER_CD. For queries on a specific allergic reaction, use MODIFIER_CD. For queries on an allergen regardless of reaction type, leave MODIFIER_CD out, and use the appropriate SELECT DISTINCT ... GROUP BY ... subquery. For medications, the Inpatient/Outpatient/Historical/PRN/Other set of modifiers can co-exist with each other and with the Dose/Dose_MAR ones on the same drug during the same encounter. Pulling only facts with Dose or Dose_MAR modifiers /might/ be enough to prevent redundancies, but not yet certain. For procedure orders filter on MODIFIER_CD = '@' if you want all procedures, or if you want just procedures of type XX 'MODIFIER_CD = 'PROCORDERS:XX', but do not ignore these modifiers or you will get redundant values. For family history always use modifiers and treat the combination of modifier and concept code as a distinct column OR if you don't care about what type of family history, just filter on MODIFIER_CD != 'FamRelation:18' and then have the appropriate SELECT DISTINCT ... GROUP BY ... subquery (or for confirmed negative family histories, MODIFIER_CD = 'FamRelation:18'). * *valtype_cd*: Only lab results (COMPONENT_ID), flowsheets (FLOW_MEAS_ID), meds (MEDICATION_ID), vitals (PAT_ENC), PACK_PER_DAY, and TOBACCO_USED_YEARS use this field but the latter two always have the value set to 'N'. _Recommendations_: can be used to
RE: Example (Re: Empirical Data Dictionary)
I started trying out the script last week. 1st bit of feedback: It was taking a lot longer than the estimated 20 minutes. Just building the 1st intermediate table took longer than that. I have since lost my sesson/context (had to reboot my desktop... sigh...) I guess my high order feedback is: which end is up? What's the goal here? I was expecting something related to cohort characterization, but if it is related, I don't understand how. Perhaps it's clear/obvious to others. The script comment begins: I originally started out writing a cohort-characterization script that could be adapted to the needs of many different research project just by altering the cohort selection criteria (since most cohort characterization needs seem to follow the same basic pattern). However, from UTHSCSA's own trial-and-error and from discussions on the mailing lists it has become clear that for multi-site counts to get non-misleading results we need to have a very detailed characterization of each site's I2B2 semantics within its CRC (data repository) cell. I'm interested to know more about the trial-and-error steps. I've seen pretty good cohort characterization progress w.r.t. breast cancer and ALS based on HERON data builder results (which are analagous to i2b2 PDO results as seen in the timeline plug-in). I don't understand why that approach isn't suitable. The script comment continues: * To create a sort of empirical data dictionary we can all refer to in constructing federated queries that will run correctly at GPC sites as they *currently* are, so that cohort characterization can proceed in parallel with data standardization. How is this emperical data dictionary different from babel? We've been proceeding in parallel, no? We do manual terminology alignment in some cases (e.g. breast cancer procedures for #119) while we work out the details of automated terminology alignment (#160). * To help those working on data standardization see what else needs to be done and where it needs to be done the most. * To make certain data quality issues easier to spot. Data quality issues such as... what? We certainly have a long list of things to be done for data standardization. Can you give an example of how this script helps with determining where it needs to be done most? About the items below... For domain, why not the first part of the concept path? I wonder how free text got into tval_char; the HERON ETL scripts are fairly conservative about avoiding free text. Structuring a list of recommendations around modifier_cd and such seems odd to me. -- Dan From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Alex Bokov [bo...@uthscsa.edu] Sent: Monday, October 06, 2014 12:57 PM To: gpc-dev@listserv.kumc.edu Subject: Example (Re: Empirical Data Dictionary) Kind thanks to KUMC and the sites that volunteered to test. With help from Wisconsin and MCW so far, I have a lot of revisions to add to the original script that make the output smaller, the syntax less Oracle-specific, and eliminate or scrub certain fields. If you're not one of the test sites, you might want to hold out for the revised script that I'll send out after having had time to digest the initial reports. In response to some very valid concerns raised, the raw count tables will not be released without explicit consent from the originating sites. Instead, I propose to append to the enclosed table (currently representing UTHSCSA data) and send that out, along with other summary metrics that people might suggest as useful to data standardization and quality control. Is everyone okay with that? I'll have analogous summaries of the other three tables, but those are still in preparation. Here is the information I was able to obtain from running the scripts I sent earlier on our own data in San Antonio. In 33 lines we get a wealth of hints about how to make queries work as intended at our site, and things we need to improve. Part of this was done by aggregation of OUTPUT_CON_MOD within SQL, but part of it was just manual tinkering with autofilters on a spreadsheet and describing what I saw. Here is what each column means and why one should care: * site: Site for which this row of data was generated * domain: A consistent substring shared by a group of related CONCEPT_CDs (concept codes). These should be all listed in the YOURMETADATA.SCHEMES table, but I2B2 has no way of enforcing this rule, so instead I got these from the CONCEPT_CDs that are actually there. * tval_char: Usually this field to contain categoric values (ventral/dorsal/lateral, left/right, high/middle/low/below-threshold, etc.) but it often ends up with a lot of free-text values dumped into it. In San Antonio data, COMPONENT_ID facts use it as a free-text dump, FLOW_MEAS_ID facts use it as a time-stamp (but the time portions are all 0's), and MEDICATION_ID facts use it
RE: Updated milestone report from todays GPC-DEV call; 4Public
Please copy gpc-dev when you submit it, John. -- Dan From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of John Steinmetz [jsteinm...@kumc.edu] Sent: Tuesday, September 30, 2014 1:24 PM To: 'Campbell, James R'; gpc-dev@listserv.kumc.edu Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public Fantastic. Thanks to all. I will get it submitted to PCORI. John. From: Campbell, James R [mailto:campb...@unmc.edu] Sent: Tuesday, September 30, 2014 1:08 PM To: gpc-dev@listserv.kumc.edu; John Steinmetz Subject: Updated milestone report from todays GPC-DEV call; 4Public All I believe I made all the stated changes to personnel...please check me out John use this copy for preparing your report Jim The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: Updated milestone report from todays GPC-DEV call; 4Public
Here is the linkhttps://pcornet.centraldesktop.com/c4gpc/file/34887855/ to the deliverable on CDT. It is one document that contains 2.1, 2.5, and 2.7. 2.3 was previously submitted, and can be found on CDT at this linkhttps://pcornet.centraldesktop.com/c4gpc/file/32487317/. John. From: Dan Connolly Sent: Monday, October 06, 2014 1:24 PM To: John Steinmetz; 'Campbell, James R'; gpc-dev@listserv.kumc.edu Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public Please copy gpc-dev when you submit it, John. -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of John Steinmetz [jsteinm...@kumc.edu] Sent: Tuesday, September 30, 2014 1:24 PM To: 'Campbell, James R'; gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: RE: Updated milestone report from todays GPC-DEV call; 4Public Fantastic. Thanks to all. I will get it submitted to PCORI. John. From: Campbell, James R [mailto:campb...@unmc.edu] Sent: Tuesday, September 30, 2014 1:08 PM To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu; John Steinmetz Subject: Updated milestone report from todays GPC-DEV call; 4Public All I believe I made all the stated changes to personnel...please check me out John use this copy for preparing your report Jim The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
Re: [gpc-informatics] #158: usable view of LOINC lab terms
#158: usable view of LOINC lab terms -+ Reporter: rwaitman | Owner: budh0007 Type: enhancement | Status: assigned Priority: major| Milestone: data-domains2 Component: data-stds| Resolution: Keywords: | Blocked By: Blocking: 68 | -+ Changes (by dconnolly): * milestone: = data-domains2 -- Ticket URL: http://informatics.gpcnetwork.org/trac/Project/ticket/158#comment:3 gpc-informatics http://informatics.gpcnetwork.org/ Greater Plains Network - Informatics ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
gpc-dev agenda 7 Oct
I don't expect we'll get through all of this, so come prepared with input on the order... Tues 11amCT: 1. Convene, take roll, review records and plan next meeting * ​Meeting ID and access code: 686-845-717https://global.gotomeeting.com/meeting/join/686845717; call +1 (267) 507-0008 * meeting notes (#12https://informatics.gpcnetwork.org/trac/Project/ticket/12): 30 Sep notes OK? today's scribe: UTHSCSA * roll: all 10 DevTeamshttps://informatics.gpcnetwork.org/trac/Project/wiki/DevTeams represented? comments on the agenda? KUMC, CMH, UIOWA, WISC, MCW, MCRF, UMN, UNMC, UTHSCSA, UTSW * anything to discuss on usable view of LOINC lab terms #158https://informatics.gpcnetwork.org/trac/Project/ticket/158? * new/closed/reopened tickets: quite a few * next meeting: 14 Oct. scribe volunteer? 2. HackathonTwo?https://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo (cf HackathonOnehttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonOne) 3. data builder (#87https://informatics.gpcnetwork.org/trac/Project/ticket/87) * demo for Henderson @ uky.edu * sync up with George, Phillip on deployment (#87https://informatics.gpcnetwork.org/trac/Project/ticket/87 etc.) * discuss connections to cohort characterization (MethodsCorehttps://informatics.gpcnetwork.org/trac/Project/wiki/MethodsCore) 4. ​Empirical Data Dictionaryhttp://listserv.kumc.edu/pipermail/gpc-dev/2014q4/000567.html Alex Bokov (re #132https://informatics.gpcnetwork.org/trac/Project/ticket/132, #106https://informatics.gpcnetwork.org/trac/Project/ticket/106) 5. #173https://informatics.gpcnetwork.org/trac/Project/ticket/173 (REDCap for ALS survey at each GPC site) poll 6. KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking: Breast Cancer Cohort Selection needs Tumor Registry in i2b2 by Nov 15 milestone:bc-survey-cohort-defhttps://informatics.gpcnetwork.org/trac/Project/milestone/bc-survey-cohort-def * #168https://informatics.gpcnetwork.org/trac/Project/ticket/168, #169https://informatics.gpcnetwork.org/trac/Project/ticket/169, #170https://informatics.gpcnetwork.org/trac/Project/ticket/170, #171https://informatics.gpcnetwork.org/trac/Project/ticket/171 tumor registry counts not available * #167https://informatics.gpcnetwork.org/trac/Project/ticket/167 (Breast Cancer Cohort Selection Criteria) created by bchrischilles 7. KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking: ALS Survey in Dec 2014 milestone:survey-redcap-alshttps://informatics.gpcnetwork.org/trac/Project/milestone/survey-redcap-als * #159https://informatics.gpcnetwork.org/trac/Project/ticket/159 (GPC REDCap Service) created * #174https://informatics.gpcnetwork.org/trac/Project/ticket/174 (federated login for GPC data store) created 8. KeyGoalTrackinghttps://informatics.gpcnetwork.org/trac/Project/wiki/KeyGoalTracking: PCORNet DRN (CDM) query readiness * milestone for popmednet tickets? (#154https://informatics.gpcnetwork.org/trac/Project/ticket/154 ... #166https://informatics.gpcnetwork.org/trac/Project/ticket/166) * note longer-term goal #160https://informatics.gpcnetwork.org/trac/Project/ticket/160, milestone:data-domains2https://informatics.gpcnetwork.org/trac/Project/milestone/data-domains2 9. age queries ticket:67#comment:11https://informatics.gpcnetwork.org/trac/Project/ticket/67#comment:11 Sep 2 -- Dan ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: pattern of cohort characterization needs? RE: Empirical Data Dictionary
Nice. For bonus points, update the obesity data elements ticket (#33https://informatics.gpcnetwork.org/trac/Project/ticket/33) to note this pattern in general and the obesity aspects in particular. For sensitive material, put it in the KUMC REDCap project I recently invited you to: * GPC Cohort Characterization Workhttps://redcap.kumc.edu/redcap_v5.7.7/ProjectSetup/index.php?pid=3560 and point to it from #33 -- Dan From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Alex Bokov [bo...@uthscsa.edu] Sent: Monday, October 06, 2014 1:45 PM To: gpc-dev@listserv.kumc.edu Subject: Re: pattern of cohort characterization needs? RE: Empirical Data Dictionary I don't claim this covers all cases, only the ones I can think of so far. If anyone can think of a cohort characterization question that cannot be answered by the below procedure, I am interested in learning about it. On 10/03/2014 04:56 PM, Dan Connolly wrote: most cohort characterization needs seem to follow the same basic pattern What pattern is that? Preliminary Cohort Characterization 1. Elicit from the domain experts minimal criteria for membership of a patient in the cohort of interest (i.e. cast a wide net) 2. Elicit from the domain experts facts at the visit level, that are of interest about those patients 3. Pull down all available demographic data for that patient set 4. Left join the above to a column containing the total visit count for each patient broken up by year 5. For each fact from #2 join an additional column with the visit count for each patient You now have one row for each year each patient is in the system, with a separate column for each static value for that patient, a column for the total number of visits they had that year, and an additional column for each subset of those visits your domain experts flagged as possibly interesting. 6a. For deliverables asking for the number of distinct patients meeting a certain criterion, COUNT all the visit counts grouping by every demographic variable of interest and year. 6b. For deliverables asking for the number of distinct visits meeting a certain criterion, SUM all the visit counts grouping by every demographic variable of interest and year. 6c. If you want totals over all years in the system, for visits just SUM up the years. For patients, SELECT DISTINCT patients, demographic variables, and indicator variables for whether the number of visits in each category is 0 or 1 (omit years this time). Or, do #5 but omit year in the first place. 7. Filter OBSERVATION_FACT by membership of PATIENT_NUM in the patient-set from #1 and then do a count of visits and/or of patients for each CONCEPT_CD (filtered in a domain-appropriate manner on MODIFIER_CD). 6a and 6b Tell you whether it's feasible to require that certain observations be present for each visit or each patient (i.e. that if you did that, your inclusion criteria would not so strict that you'd up with an insufficient sample size). They also give you an idea of your cohort's demographic makeup and how/if it has changed over time. 7 tells you what the most common facts are for this preliminary cohort, even if they were not singled out by the domain experts. In consultation with them, additional selection criteria might be drawn. Refinements of Cohort Characterization 1. Optionally tighten the membership criteria (e.g. in our initial characterization it looks like most patients who have one XYZ measurement on file have half a dozen of them, so might as well make that the floor) and optionally limit the time range (e.g. initial characterization indicates we have large samples available between 2010-2013, so let's use only those years to begin with). 2. Optionally revise the visit-level features of interest (e.g. A procedure hardly ever gets ordered? Omit it this time. A drug you weren't aware of turns out to be prescribed to 30% of the patients? Dedicate a new column to it.). This may be the place to put in complex temporal queries so you aren't grinding the server on a huge dataset needlessly. 3. Are some of the original demographic variables too sparse for this cohort, or not used at all? Optionally omit them. 4-7. As above. Repeat as necessary (I expect one iteration to be enough in many cases) until the clinicians and informaticians converge on a patient-set and visit-set of adequate size and relevant to the clinical problem of interest. You'll notice that there is variability from study to study in two places: A. The 'WHERE' clause for selecting the patient-set. B. The 'WHERE' clause in each variable column. Everything else is looking like it could be factored out into a generic query or procedure. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
Re: Example (Re: Empirical Data Dictionary)
First of all, I feel like I'm treading into sensitive territory here, in that where I point out on problems it can be misinterpreted as criticizing other people's work. So let me just re-affirm: We are all geeks here. It's us against the bugs. When we engage in debate, both sides win by finding the optimal answer and it doesn't matter whose answer it turns out to be. Both sides lose by talking past each other and taking things personally. I admire your work Dan, and the team you put together, and y'all are a role model to us and hopefully to other sites. I am grateful to you guys for sharing the Heron ETL code. Adapting it has saved us YEARS of effort, and it was and is far better than anything I can even imagine writing from scratch (and I tried, before I came to understand the Zen of Heron). This is me trying, in a small way, to give something back. Unfortunately that something is verbose and probably annoying to read. Sorry. I spent most of the day trying to file the rough edges off this and probably didn't get most of them. Anyway, the point is, I mean it constructively. On 10/06/2014 01:20 PM, Dan Connolly wrote: I started trying out the script last week. 1st bit of feedback: It was taking a lot longer than the estimated 20 minutes. Just building the 1st intermediate table took longer than that. Sorry. Revised script in progress. I'm guessing the big resource hog was OUTPUT_CON_MOD? In the revised script, I filter the result-set on `WHERE N_VISITS 10` and that cuts down its size by about 50%. I guess my high order feedback is: which end is up? What's the goal here? I was expecting something related to cohort characterization, but if it is related, I don't understand how. Perhaps it's clear/obvious to others. We can't do cohort characterization or any useful queries at all if we don't understand which modifiers matter and which don't, which values can occur and which cannot, which type of data are always used and which aren't, which of our assumptions about each field in OBSERVATION_FACT are valid and which aren't. And, how much of this differs from site to site. The parent post to this one was my attempt to clarify what I was trying to accomplish. If I can get OUTPUT_CON_MOD from each site (abridged, to drop the 10 occurence lines) I can munge it all into this one sheet, and report back to the list these answers for all the sites obtained by inspection of this sheet. I'm interested to know more about the trial-and-error steps. Something like this: this seems like a query will get the data we're looking for. Wait, how come we have duplicate entries? Oh, I see, you can have multiple encounters per calendar day. Work around it. Wait, how come we have duplicate entries? Oh, I see, a procedure order with '@' in its MODIFIER_CD field will always be accompanied by a second identical procedure order with a MODIFIER_CD that specifies whether it is Inpatient or Outpatient. Maybe a bug, maybe a feature, but I guess I better work around it in my queries. Wait, how come we have duplicate entries? Oh, I see diagnoses have a similar rule governing MODIFIER_CD usage but more complicated. Wait, why do some diagnoses codes look different from others? Oh, I see, the DX_ID ones are more specific than the ICD9 ones, and you have to do some text parsing to retrieve ICD9 codes from them. Wait, how come we STILL have duplicate entries... ...screw this, we'll be at this forever and still not notice the more subtle/obscure stuff. Let's just count every combination of non-numeric, non-free-text values (this was before realizing just how variable TVAL_CHAR and UNITS_CD are) and in one shot see what is a common case, what is a rare case, and what is a nearly impossible case. And now, let's see if it's even remotely similar to what's going on at other sites. I've seen pretty good cohort characterization progress w.r.t. breast cancer and ALS based on HERON data builder results (which are analagous to i2b2 PDO results as seen in the timeline plug-in). I don't understand why that approach isn't suitable. I haven't had a chance to look at the ALS script they posted yet, I need to do that, I may learn some useful stuff. Are you saying there is now a way to get PDOs or data builder results into a denormalized tabular format that can be analyzed in SAS or R? Because that was the main problem I encountered before meeting with success just using SQL. I decided that I'd fall back on R only after I hit a problem SQL couldn't handle, and so far this hasn't happened. Also, much as I like R, there is a lot more SQL expertise in the GPC than R expertise, and everyone already has some type of SQL database because I2B2 requires it. So a suitably dialect-independent SQL script is easier to deploy than convincing everyone to add R to their toolchain. The script comment continues: * To create a sort of empirical data dictionary we can all refer to in constructing
RE: Example (Re: Empirical Data Dictionary)
What you've written is entirely responsive to my questions. As I say, I'm largely ignorant of this whole field (emperical data validation), so I appreciate the being educated by way of verbose specifics. I think I have a few substantive questions in response, but I want to read over what you wrote a couple more times. -- Dan From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Alex Bokov [bo...@uthscsa.edu] Sent: Monday, October 06, 2014 5:05 PM To: gpc-dev@listserv.kumc.edu Subject: Re: Example (Re: Empirical Data Dictionary) First of all, I feel like I'm treading into sensitive territory here, in that where I point out on problems it can be misinterpreted as criticizing other people's work. So let me just re-affirm: We are all geeks here. It's us against the bugs. When we engage in debate, both sides win by finding the optimal answer and it doesn't matter whose answer it turns out to be. ... ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
R, SQL, i2b2, and governance RE: Example (Re: Empirical Data Dictionary)
(Please excuse the awkward top-posting format; I'm stuck with Microsoft Outlook.) Perhaps we're converging... the new Data Builder codehttps://informatics.gpcnetwork.org/trac/Project/ticket/134#comment:4 delivers an sqlite3 file, so you can continue to use SQL to analyze it; and if you like python or Java better than R for post-SQL work, that's fine too. But note that each Data Builder result is based on an i2b2 patient set that came from and audited i2b2 query. We don't have governance to let investigators run arbitrary SQL queries on our whole clinical data warehouse and we don't plan to (neither KUMC HERON nor GPC). For the 3 initial cohorts, we can get away with ad-hoc one-off work, but for GPC work in general, we plan do use i2b2 to do as much of the querying as we can. -- Dan From: gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Alex Bokov [bo...@uthscsa.edu] Sent: Monday, October 06, 2014 5:05 PM To: gpc-dev@listserv.kumc.edu Subject: Re: Example (Re: Empirical Data Dictionary) [...] I've seen pretty good cohort characterization progress w.r.t. breast cancer and ALS based on HERON data builder results (which are analagous to i2b2 PDO results as seen in the timeline plug-in). I don't understand why that approach isn't suitable. I haven't had a chance to look at the ALS script they posted yet, I need to do that, I may learn some useful stuff. Are you saying there is now a way to get PDOs or data builder results into a denormalized tabular format that can be analyzed in SAS or R? Because that was the main problem I encountered before meeting with success just using SQL. I decided that I'd fall back on R only after I hit a problem SQL couldn't handle, and so far this hasn't happened. Also, much as I like R, there is a lot more SQL expertise in the GPC than R expertise, and everyone already has some type of SQL database because I2B2 requires it. So a suitably dialect-independent SQL script is easier to deploy than convincing everyone to add R to their toolchain. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev