WISC does have seq num 380 (central). It appears that the sites are split on whether they have values for 380 vs 560. As long as you have values for one of these two, then you're consistent with what everyone else has sent. FYI, gpc_seq_no in the datamart is populated by both of the two variables; it takes 380 if that is present in the datamart, otherwise it takes 560.
From: Debbie Yoshihara [mailto:dlyos...@wisc.edu] Sent: Monday, February 22, 2016 11:26 AM To: McDowell, Bradley D; gpc-dev@listserv.kumc.edu Subject: RE: data issues Brad, Ok, thanks for that clarification. I'm just trying to figure out if we need to pull additional values from the tumor registry feed. It looks like we have to figure out the 560 SEQNO HOSPITAL values. Thanks. --- Debbie From: McDowell, Bradley D [mailto:bradley-mcdow...@uiowa.edu] Sent: Monday, February 22, 2016 11:22 AM To: Debbie Yoshihara; gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: RE: data issues Sorry, now that I think of it, I believe "tr" means "tumor registry". So I'll bet vital_tr was copied over from NAACCR1760. gpc.vital would be the variable that takes values from gpc_vital_tr and/or gpc_vital_ehr. From: McDowell, Bradley D Sent: Monday, February 22, 2016 11:14 AM To: 'Debbie Yoshihara'; gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: RE: data issues I think gpc_vital_tr uses values from both gpc_vital_ehr and NAACCR 1760. If one of the original values isn't present in the dataset, then the other value is used to create vital_tr. That's my best guess. All of the GPC variables were created by Vince to create the datamart. There may be documentation at KUMC to confirm whether this is true. From: Debbie Yoshihara [mailto:dlyos...@wisc.edu] Sent: Monday, February 22, 2016 11:05 AM To: McDowell, Bradley D; gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: RE: data issues Hi Brad, Ok, I'm confused, what is the difference between gpc_vital and gpc_vital_ehr then? --- Debbie From: McDowell, Bradley D [mailto:bradley-mcdow...@uiowa.edu] Sent: Monday, February 22, 2016 10:59 AM To: Debbie Yoshihara; gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: RE: data issues Hi Debbie, My understanding is that the gpc_vital_ehr provides the vital status from the electronic health record, whereas the value from NAACCR 1760 comes from the tumor registry. So we do have vital status from WISC, it's just that the one of these two (redundant) variables wasn't populated. This should probably be part of a larger conversation: It might not make sense to have the EHR version of this variable for these kinds of GPC data sets. Sometimes two versions of vital status are included in non-GPC data sets because one source might be more recently updated. I get the impression that these two variables might be perfectly redundant for some GPC sites; for at least one site, the NAACCR version was calculated for the purpose of this data set using the EHR. If that's a common approach, then perhaps we should just drop the EHR version in favor of the NAACCR-standardized version. Hope this helps... Brad From: Debbie Yoshihara [mailto:dlyos...@wisc.edu] Sent: Monday, February 22, 2016 10:37 AM To: McDowell, Bradley D; gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: RE: data issues Hi Bradley, I wanted some clarification about gpc_vital_ehr. The NAACCR codes NAACCR|1760:0 and NAACCR|1760:1 appear to hold these values and WISC has these values, so I'm not sure why it's showing 100% missing for WISC. --- Debbie Yoshihara From: gpc-dev-boun...@listserv.kumc.edu<mailto:gpc-dev-boun...@listserv.kumc.edu> [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of McDowell, Bradley D Sent: Monday, February 01, 2016 11:01 AM To: gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: FW: data issues Dan asked me to forward this message to this group: From: McDowell, Bradley D Sent: Tuesday, January 26, 2016 11:00 AM To: Dan Connolly (dconno...@kumc.edu<mailto:dconno...@kumc.edu>) Cc: Chrischilles, Elizabeth A; Gryzlak, Brian M Subject: data issues Hi Dan, Betsy asked me to provide a list of the issues that have been uncovered so far with respect to the oncology registry data. The hope is that we can establish tickets for each problem. There are some problems that are easy to describe, and some that are not so easy. Some of the easy ones: * Missing MCW patient (https://informatics.gpcnetwork.org/trac/Project/ticket/453) * Duplicate records (indicated with equivalent sequence number for same Study ID), some with updated surgery, class of case variables * Some duplicate records have dx dates that appear to have been copied from last contact date (dx date not the same across duplicates) * UMN switched | and : for NAACCR variables (not "*.Descriptor" variables) For the duplicated records, I have put together a spreadsheet that nicely illustrates the problem, and I'm happy to share that. We'll have to transfer it via redcap or some other secure means since it contains patient level data. Regarding the not so easy issues: * One problem concerns inconsistencies in coded values. For example, gpc_language has four different values for "English". In general, UIOWA is not using the same descriptor values as other sites, and that accounts for most of these. It is not the only offender, however. MCW uses a different convention for seer_site_breast (as does UIOWA) and Race descriptors are different for UIOWA and WISC. These inconsistencies have percolated through to the derived GPC variables. I am writing a mapping program to handle this with the registry data we have received so far. I'm certainly willing to share what I have if it would help you. * Another big problem concerns missing values. I have attached a report that provides the percentage of missing values, organized by site and variable. This illustrates, for example, that UIOWA has no data for the Race 5 variable (i.e., 100% of the values equal "NA"; this does NOT reflect cases where a value is assigned for the NAACCR code for 'missing'). It also illustrates some other things that we have discussed; for example, sites that reported data for central sequence number did not report data for hospital sequence number (and vice versa). o UIOWA and MCRF appear to have the biggest problems with missing data. * We also need to figure out why so many patients in our database do not appear to have tumors diagnosed between 01JAN2013 and 01MAY2014. (General observation: You'll notice that each of the NAACCR concepts correspond to two variables (e.g., N0670_Surg_Prim_Site and N0670_Surg_Prim_Site_D). Vince and I settled on this arrangement for the datamart. Since then, though, I've come to believe that the redundancy makes the database difficult to use. Perhaps we could keep that in mind for future data cuts.) I'm very happy to work on these problems with you. Would you like to schedule a phone call to plan out how to approach these issues? Thanks, Brad ------------------------------------------------------------ Bradley D. McDowell, Ph.D. Director, Population Research Core Holden Comprehensive Cancer Center 5240 MERF | The University of Iowa | Iowa City, IA | 52242 Office: 319-384-1768
_______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev