Duplicates or synonyms in NAACCR ICD_O_MORPH?
In QA-ing our NAACCR data, we found some apparent duplicates in the NAACCR ontology as produced by the heron code. Duplicates being defined as two metadata records with the same c_fullname (not synonyms). These appear to be caused by minor differences in spelling and punctuation in ICD_O_MORPH, like tumor vs. tumour. For example, this query: SELECT * FROM I2B2_DEV_ETL..ICD_O_MORPH icdo where icdo.CONCEPT_NAME like '%8010%' order by concept_cd; yields records with concept_name of '8010/0 Epithelial tumour, benign' and '8010/0 Epithelial tumor, benign' with the only difference being the English spelling of tumour. There are other minor differences like '8010/6 Carcinoma, metastatic NOS' '8010/6 Carcinoma, metastatic, NOS' /* extra comma */ This in turn was caused by slight differences between MORPH2 and MORPH3, aka ICD-0-2 and ICD-0-3. So what if anything did you folks do with this? Essentially they're synonyms (if I understand I2B2 synonyms correctly). Are they useful as such? Or did you just wind up nuking the extras on more or less random criteria (like getting rid of all the tumour entries or some such)? Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
NAACCR 0521 Morphology - extra values
On checking fact table concepts vs. concept dimension for UW NAACCR, we have ~200 fact records (out of 60,000) whose concepts don't match up. In other words, their 521-Morphology code value (actually 522 - Histology plus 523-Behavior) doesn't match any value in the ICDO_O_MORPH table set up in the naaccr_concepts_load.sql script. Examples are 98153, 80412, 80722. The most likely explanation is data errors of some kind. So first, have others experienced this sort of thing with field 521? Are there other possible explanations? Second, assuming this data is defective, is there a protocol or principle to follow with such records? i.e., gently correct them vs. nuke them on sight? Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
NAACCR - Free text fields
In trying to ensure all my Tumor Registry fact table items have corresponding concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual Occupation) which has a format of Free text. As you'd expect there are no entries in the NAACCR metadata for that. However, following the Heron fact load code, I created a whole bunch of facts with concept codes like NAACCR|310:(n)TH GRADE TEACHER - (small Wisconsin town) SCHOOL DISTRICT To my semi-trained eye this looks like it would be pretty useless to I2B2. So I'm wondering what other sites do in similar situations. Possibilities that have occurred to me already are: 1) Exclude all free text format fields from the fact load. 2) Leave them in, hoping for codification someday Is there any reason NOT to exclude free text fields? Or some criteria to include some and exclude others? Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: NAACCR - Free text fields
OK, I got the answer to my question (“Do we exclude free text fields?”), which is “Yes.” Thx. Now, your questions: Our basic ETL to get UW NAACCR data into the database is in Informatica, and mostly resolves differences between UW field names and official NAACCR field names. I also do the de-identification step in Informatica. From then on I largely follow the naaccr_txform script. I used your code to include/exclude NAACCR sections and items based on numerous criteria, but I don’t believe I see item format of “free text” as one of the criteria. So my fields included 310 and 320, which I can easily fix. I did vary somewhat in an attempt to build in some flexibility. Results are essentially the same as your system, so this may be largely academic, but you asked, so: Where I varied was an attempt to set up my NAACCR metadata in such a way that I wouldn’t have to hard code which NAACCR sections and items were included in the EAV. To do so I added a logical “ONTOLOGY” field to the section and item tables like so (apologies for formatting): DROP TABLE naaccr_tblSection ; CREATE TABLE naaccr_tblSection ( Section_ID integer not null primary key , Sectionvarchar(46) , SectionPathName varchar(212) , SECTION_ONTOLOGY varchar(1) , SectionBaseCode varchar(46) ); DROP TABLE naaccr_tblItem ; CREATE TABLE naaccr_tblItem ( ItemIDinteger not null primary key, ItemNbr integer, ItemName varchar(512), AllowValue varchar(204), ItemPathNamevarchar(212), FieldLength integer, SectionID integer, ItemFormatvarchar(125), Item_Ontology varchar(1) , ItemBaseCode varchar(212) ); DROP TABLE naaccr_tblCode ; CREATE TABLE naaccr_tblCode ( CodeID integer not null primary key, ItemID integer, CodeNbr varchar(212), CodePathvarchar(212), CodeDcrpvarchar(198) ); /* To see what NAACCR items are included in our extract, run: */ create or replace view vw_i2b2_naaccr_active_items as (select s.section_id, s.section, i.itemid, i.itemnbr, i.itemname, i.itempathname, i.ITEMFORMAT from naaccr_tblitem i join naaccr_tblsection s on i.SECTIONID = s.SECTION_ID where i.ITEM_ONTOLOGY = 'Y' AND s.SECTION_ONTOLOGY = 'Y') ; After I load the NAACCR tables, I set the initial value for the ONTOLOGY fields according to the criteria in naaccr_txform.sql. So when creating my extract_eav equivalent, instead of the hard coded section selection “and ns.SectionID in ( 1 -- Cancer Identification , 2 – Demographic , etc…” I select sections and items like so: “ from naaccr_tblitem i, naaccr_tblsection s where i.sectionID = s.section_ID and s.section_id is not null and s.section_ontology = 'Y' and i.item_ontology = 'Y' and i.FIELDLENGTH is not null and i.ITEMBASECODE is not null -- another way of excluding a field/item order by s.section_id, i.ITEMNBR “ So, the main advantage is I can select a full section while excluding selected items in the section. From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Monday, March 09, 2015 9:25 AM To: Lenon Patrick; gpc-dev@listserv.kumc.edu Subject: RE: NAACCR - Free text fields Perhaps you could back up and explain your overall approach to the NAACCR ETL? What code are you using? Where can I look at it? Perhaps you've discussed this before, but I don't see any pointers to context in this message. Did you try the HERON code? If not, why not? If so, what happened when you tried? As to this specific question, it's documented on the TumorRegistryhttps://informatics.kumc.edu/work/wiki/TumorRegistry page: We reviewed the data we get by section to eliminate potentially sensitive data, including free-text; the sections with a -- below are not loaded into HERON: followed the relevant code excerpt from source:heron_load/naaccr_txform.sql#L67https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql#L67https://informatics.kumc.edu/work/export/HEAD/heron_load/naaccr_txform.sql#L67 -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Monday, March 09, 2015 9:17 AM To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: NAACCR - Free text fields In trying to ensure all my Tumor Registry fact table items have corresponding concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual Occupation) which has a format of “Free text.” As you’d expect there are no entries in the NAACCR metadata
RE: NAACCR Ontology - Accession and Seq No?
I don't know if there's an actual bug in your code, or if it's just a part of the Heron code I couldn't use as is. Your tumor_item_value is a view while I had to do mine as a table (Netezza insisted), for instance, so you may have something in the view definition that was added later. There could also be something in the definition of the NAACCR fact table (naaccr.extract_eav) that tumor_item_value draws on. At any rate, I've already excluded those fields from my ontology. Thx for the response. From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Thursday, February 05, 2015 4:30 PM To: Lenon Patrick; gpc-dev@listserv.kumc.edu Subject: RE: NAACCR Ontology - Accession and Seq No? No, the accession number should definitely not be in the observation_fact table. That would be a bad bug. I'm trying to reproduce it here... -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Thursday, February 05, 2015 10:42 AM To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: NAACCR Ontology - Accession and Seq No? Hello again all, In building NAACCR concepts via my modified versions of KUMC's naaccr_txform.sql and naaccr_concepts_load.sql, I'm producing concepts and facts for each value of both Accession No. and Hospital Sequence No. This seems unnecessary, since I can't see accession number being part of a researcher's query, not to mention possible de-identification issues. So, my questions are: 1) Is this actually the correct outcome? 2) Assuming it isn't, did I miss a step where these two fields (or their section, 6 Hospital-Specific) are filtered out of the fact and concept builds? I'd expect that to be in the definition of the big flat view aka tumor_item_value. Awaiting your input eagerly. Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
NAACCR Ontology - Accession and Seq No?
Hello again all, In building NAACCR concepts via my modified versions of KUMC's naaccr_txform.sql and naaccr_concepts_load.sql, I'm producing concepts and facts for each value of both Accession No. and Hospital Sequence No. This seems unnecessary, since I can't see accession number being part of a researcher's query, not to mention possible de-identification issues. So, my questions are: 1) Is this actually the correct outcome? 2) Assuming it isn't, did I miss a step where these two fields (or their section, 6 Hospital-Specific) are filtered out of the fact and concept builds? I'd expect that to be in the definition of the big flat view aka tumor_item_value. Awaiting your input eagerly. Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: NAACCR Encounter type?
Thanks, I think you answered my followup as well. VERY efficient. ;) From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Friday, January 30, 2015 9:44 AM To: Lenon Patrick; gpc-dev@listserv.kumc.edu Subject: RE: NAACCR Encounter type? We discussed encounters a bit at HackathonTwo; the only thing I remember is that I'm supposed to split ticket #155. More on that in due course. Meanwhile... It's by design (limitation) that NAACR encounters don't show up in the Encounters portion of the i2b2 tree, at least the KUMC i2b2 tree. And yes, they're unique; i.e. we don't (yet) support same-encounter queries between the tumour registry and anything else; we're considering per-patient-day encounter support that should result in a bit of integration. (for KUMC folks: #333 and nearby tickets tagged encounter-mapping) Encounter type support is weak in HERON ETL. The only encounter types you'll see from HERON ETL come from Epic, and then only for in-patient, I think. In the visit_dimension generated by HERON ETL, the encounter types are all null. I hope other (Hubert, ...) take the lead on this and produce something we can integrate into HERON ETL. -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Friday, January 30, 2015 9:12 AM To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: NAACCR Encounter type? Hey all, I'm close to loading our NAACCR data into our I2B2 instance, which involves generating encounter records (per the tumor_reg_visits table referenced in KUMC's naaccr_facts_load.sql). These encounters don't line up with any existing encounters, they're all unique. My question is, should these NAACCR encounters be represented in the Encounters portion of the I2B2 tree? I checked Babel and spot-checked a few setups and didn't see any reference to NAACCR under Encounter Type, which is where I'd expect to find it. Has this been considered/discussed/rejected? Your thoughts are welcome. Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: HackathonTwo follow-up action items
Regarding the #44 Portable Heron notes, one situation I’ve dealt with is references to a table which I do not have. From naaccr_concepts_load.sql, for instance: (select * from BlueHeronData.source_master@deid where source_cd like 'tumor_registry@%') I don’t have a “source_master” table, and can’t begin to guess what “*” might comprise. (OK, I know one field is called “source_cd”.) In other cases I’ve been able to infer what’s missing, or what our local name for a given table is, but this one stopped me cold. For a database script to be portable, there need to be definitions somewhere of the tables and fields referenced. This could be in the form of a data dictionary, or it could be simply output of data definition language (DDL) from the tables referenced. Oftimes the first script run does all the DDL up front, with data manipulation language (DML) in a separate script that can be run iteratively. Apologies if I missed something obvious. Peril of being the “new kid”… p.s. If I did miss it,kindly point it out using small words and simple gestures. ;) From: gpc-dev-boun...@listserv.kumc.edu [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Dan Connolly Sent: Tuesday, January 27, 2015 9:46 AM To: gpc-dev@listserv.kumc.edu Subject: HackathonTwo follow-up action items I'm going over the meeting noteshttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo#record (thanks, Laurel!) looking especially for follow-up items. I added comments that some of them aren't quite clear (as well as presentation materials to find, etc.) This is a summary of follow-up action items that I found so far: Names in section headings indicate follow-up item ownership. Day 1 - Thu, Jan 22https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.w3yrm6qir16q Introductions, Opening Remarks - GPC Phase 2 LOIhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.46n62mizfpxm AM Session 1 - GPC CDM ETL (Campbell, Graham?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.1ahl9lh7w5ei AM Session 1 - Heron Code Sharinghttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.zdilnib5ekxc identified i2b2 (Mosa@MU)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.523suzwa3zeu AM Session 2 - Breast Cancer Survey Finder FIle (Kowalski, UMN, UIOWA, MCRF, UTSW, McMahon)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.nf8258clqxww AM Session 2 - Obesity, BMI percentile (?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.okvaig1ygap4 AM Session 2 - Terminology Mapping Strategieshttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.c9scujdqyl8e PM Session 1 - Terminologies (Reeder, Campbell, Connolly)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.bl4ohqy5t9xr PM Session 2 - Data Quality (?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.v6pigqrdquys PM Session 2 - Encountershttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.qhq7fmqwj9xi PM Session 2 - Text Deidentification (Jacquie @ MCRF)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.mhfn3n5l9hx DAY 2https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.4te681ghpj2u AM Session 1 - Usable LOINC Lab Hierarchy - (Apathy)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.5yzy9wlfcb19 AM Session 1 - NLP/Text Notes Code Sharinghttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.ydcola9j7ok AM Session 2 - Federated login (Mish)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.wnkejbhu1tmr AM Session 2 - Building Analytic Datasetshttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.u7ngvpz55k2i Using heron_extract to reshape data for use in REDCaphttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.nvkllaqj6uky Analyzable Datahttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.tfawvpu2vbot Data Analyzer User Interfacehttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.dpnvwd4jzxgm PM Session 1 - EMR Integrationhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.xy7jgds0qjnw PM Session 2https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.q0osz6mvtv3c -- Dan
RE: code sharing an HERON ETL documentation (was: HackathonTwo follow-up action items)
I don’t see a source_master table here, so I’ll have to check that with Keith when he gets back. I was able to finesse it for the short term, anyway. I did succeed in finding and replicating the “normal_concept” view also referenced in naaccr_concepts_load. Thanks for the feedback. I’ll try to spend a bit more time on multisitedev. I should mention that the link for “internal onboarding/training notes” doesn’t work. I think we discussed this earlier. https://bmi-work.kumc.edu/work/ticket/1247#comment:16 From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Thursday, January 29, 2015 2:24 PM To: Lenon Patrick; gpc-dev@listserv.kumc.edu Subject: RE: code sharing an HERON ETL documentation (was: HackathonTwo follow-up action items) I don't doubt that there are lots of undocumented aspects of HERON ETL, but the issue here seems more like the fact that 20,000 lines of code is a lot to get your head around, especially when the number of contributors has been in the single digits for a long time, so we haven't put a lot of emphasis on the new developer experience. MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev is an attempt to address this issue. I wouldn't mind devoting more meeting time to this sort of thing. In particular: source_master is part of i2b2: edu.harvard.i2b2.data grep -ri source_master . ./Release_1-7/NewInstall/Crcdata/scripts/crc_create_uploader_oracle.sql:-- Table: SOURCE_MASTER ./Release_1-7/NewInstall/Crcdata/scripts/crc_create_uploader_oracle.sql:CREATE TABLE SOURCE_MASTER ( ... So I'd be surprised if you really don't have such a table. The fact that HERON ETL relies on such tables is documented in the module header of heron_create.pyhttps://informatics.kumc.edu/work/browser/heron_load/heron_create.py: Database initialization scripts from i2b2 sources are used for this process:: options = _option # un-hide for testing options.i2b2_source 'mock_i2b2_source' The relevant task* is create_deid_schema starting on line 193https://informatics.kumc.edu/work/browser/heron_load/heron_create.py#L193: @task def create_deid_schemas(options): '''Create schemas for de-identified datamart. Note Well: Any existing schema is destroyed. .. todo:: consider checking that there's no valuable data It calls _create_datamart, where we see crc_create_uploader_oracle.sql mentioned by name. * Paver tasks and dependencieshttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev#Tasksanddependencies are discussed in MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev. -- Dan From: Lenon Patrick [ple...@uwhealth.org] Sent: Thursday, January 29, 2015 1:44 PM To: Dan Connolly; gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: RE: HackathonTwo follow-up action items Regarding the #44 Portable Heron notes, one situation I’ve dealt with is references to a table which I do not have. From naaccr_concepts_load.sql, for instance: (select * from BlueHeronData.source_master@deidmailto:BlueHeronData.source_master@deid where source_cd like 'tumor_registry@%') I don’t have a “source_master” table, and can’t begin to guess what “*” might comprise. (OK, I know one field is called “source_cd”.) In other cases I’ve been able to infer what’s missing, or what our local name for a given table is, but this one stopped me cold. For a database script to be portable, there need to be definitions somewhere of the tables and fields referenced. This could be in the form of a data dictionary, or it could be simply output of data definition language (DDL) from the tables referenced. Oftimes the first script run does all the DDL up front, with data manipulation language (DML) in a separate script that can be run iteratively. Apologies if I missed something obvious. Peril of being the “new kid”… p.s. If I did miss it,kindly point it out using small words and simple gestures. ;) From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Dan Connolly Sent: Tuesday, January 27, 2015 9:46 AM To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu Subject: HackathonTwo follow-up action items I'm going over the meeting noteshttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo#record (thanks, Laurel!) looking especially for follow-up items. I added comments that some of them aren't quite clear (as well as presentation materials to find, etc.) This is a summary of follow-up action items that I found so far: Names in section headings indicate follow-up item ownership. Day 1 - Thu, Jan 22https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.w3yrm6qir16q Introductions, Opening Remarks - GPC Phase 2 LOIhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.46n62mizfpxm
RE: [gpc-informatics] #44: portable HERON ETL for NAACCR
Thx again guys, very helpful, many blind alleys avoided. ;) -Original Message- From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Monday, January 05, 2015 3:28 PM To: Lenon Patrick; 'gpc-dev@listserv.kumc.edu' Cc: Nathan Graham; Mish Thomas F Subject: RE: [gpc-informatics] #44: portable HERON ETL for NAACCR 1) I don't remember any waiting period, but it was 2 years ago, so who knows. 2) yes 3) No; ICD-10 isn't relevant to NAACCR ETL. -- Dan From: Lenon Patrick [ple...@uwhealth.org] Sent: Monday, January 05, 2015 2:17 PM To: 'gpc-dev@listserv.kumc.edu'; Dan Connolly Cc: Nathan Graham; Lenon Patrick; Mish Thomas F Subject: RE: [gpc-informatics] #44: portable HERON ETL for NAACCR Thx, some additional questions: 1) I have not heard back from the WHO person, and my membership seems to be pending or something, so no downloads show as available for now. Did you have to go through some waiting period like this? (Maybe I mistakenly got in the commercial license queue instead of the research license queue?) 2) In your naaccr_concepts_load.sql, you refer to tables who.topo, who.morph2, and who.morph3. Do those correspond to the ICD-O-3_CSV-metadata.zip and ICD-O-2_CSV.zip you refer to below? 3) Did you not download ICD-10? Thanks as always for your assistance. -Original Message- From: GPC Informatics [mailto:d...@madmode.com] Sent: Monday, January 05, 2015 12:59 PM To: dconno...@kumc.edu; Lenon Patrick Cc: ngra...@kumc.edu Subject: Re: [gpc-informatics] #44: portable HERON ETL for NAACCR #44: portable HERON ETL for NAACCR --+--- Reporter: dconnolly | Owner: lenonpat Type: enhancement | Status: assigned Priority: major | Milestone: bc-survey-cohort-def Component: etl-dev | Resolution: Keywords: breast-cancer-cohort | Blocked By: Blocking: 119 | --+--- Changes (by dconnolly): * owner: dconnolly = lenonpat Comment: Patrick, I found my (30 Jan 2013) notes on downloading materials from WHO... The pointer in my notes is http://www.who.int/classifications/icd/adaptations/oncology/en/index.html (along with http://apps.who.int/classifications/apps/icd/ClassificationDownloadNR/license.htm ) and I recorded the md5sums of what I downloaded - `b088c4e4bd2d685c9dd04e3b3c14c98b` ICD-O-3_CSV-metadata.zip - `1308ce6f4ef93c67137154cc6a723fc6` ICD-O-2_CSV.zip -- Ticket URL: http://informatics.gpcnetwork.org/trac/Project/ticket/44#comment:8 gpc-informatics http://informatics.gpcnetwork.org/ Greater Plains Network - Informatics ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: NAACCR metadata, latest go-round
Great, thanks, especially for the bit about the shortcuts. That would have been tough to suss out on my own. From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Monday, November 17, 2014 5:43 PM To: Lenon Patrick; 'Gpc-dev@listserv.kumc.edu' Subject: RE: NAACCR metadata, latest go-round 1) I remember the data being a little messy... the code only loads the sections that I was confident had no PHI (aside from the MRN, which gets deidentified later). I eyeballed all the data in one of the views. What a pain. But I haven't found any way around it. As to how the code works... I suppose you're reading it correctly. I haven't looked at it in quite a while. If you feel like contributing a design sketch to add to the top of naaccr_concepts_load.sql, I can see that it gets code reviewed and integrated. (see for example the Medication dispense facts comment in epic_meds_transform.sql). Patches welcome, as they say. 2) My recollection is that getting the WHO files is pretty painless (at least compared with, say, UMLS); anything we would set up to reduce duplicated effort would be at least as much hassle... especially since we'd have to set it up. 3) The code to build the ontology is designed to be run each time the data is loaded. (No, I don't recall discussing this in gpc-dev). 4) If you got naaccr_shortcuts.csv from version control on elephant, you can follow your nose thru renamed shortcut concepts and reworked staging concept hierarchy (#2112https://bmi-work.kumc.edu/work/ticket/2112) to KUMC ticket #2112https://informatics.kumc.edu/work/ticket/2112 to the milford releasehttps://informatics.kumc.edu/work/milestone/heron-milford-update to... darn; there should be a link to the relevant blog article: * HERON Milford simplifies the Cancer Cases folderhttps://informatics.kumc.edu/work/blog/heron-milford-update A new folder, Cancer Cases (Abridged), debuts with the Milford release. This folder contains the frequently searched concepts from the Cancer Cases folder. Advanced searchers can still search the entire tumor registry in the unabridged folder. Oops; that should be in the list of blog articles on TumorRegistryhttps://informatics.kumc.edu/work/wiki/TumorRegistry. -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Monday, November 17, 2014 4:32 PM To: 'Gpc-dev@listserv.kumc.edu' Subject: NAACCR metadata, latest go-round Hi all, hoping some of you are in a good state of mind to share your experience, brilliance, shattering good looks... enough flattery? Anyway, my first attempt at building a NAACCR ontology was disappointing in that the NAACCR tables were not as helpful as I expected. Fundamental problem: A lot of junk in the column where the code value normally resides. So in that column I have found (besides codes): * Code ranges (e.g. 1-100, 110-12) * References to outside sources (WHO in particular) * The word BLANK, presumably to indicate the field is optional? * .. * * * Comments, sometimes with HTML markups Looking at the heron naaccr_txform and naaccr_concepts_load scripts, the ultimate NAACCR ontology consists of all unique base/concept codes found in the imported NAACCR data file (now in table tumor_reg_codes) merged with some fields from either the NAACCR code table (naaccr.t_code) or one of the external tables (e.g., WHO.TOPO). So, questions arise: 1) First, is the above description reasonably accurate? Please point out any glaring errors. I did leave out some detail like special cases. 2) I looked at WHO's site, and began applying for access to their tables as listed in heron\heron_staging\tumor_reg\icd_o_meta.py . However, before I continue, will every site have to do this individually? Tom Mish is of the opinion YES. But, does anyone have a legal/kosher/ethical way for us to not duplicate this effort? 3) If I'm correct that only codes that appear in the Registry data will be loaded into the ontology, well, is that OK? This fixes the problems with code ranges and non-code values, and the ontology tree is effectively pre-trimmed. But what are the implications for future loads of new data? Has this discussion already happened? 4) Bonus question: What the heck is naaccr_shortcuts.csv? It looks very useful, I have no idea what for. Thanks in advance for any input all of y'all provide. Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
NAACCR metadata, latest go-round
Hi all, hoping some of you are in a good state of mind to share your experience, brilliance, shattering good looks... enough flattery? Anyway, my first attempt at building a NAACCR ontology was disappointing in that the NAACCR tables were not as helpful as I expected. Fundamental problem: A lot of junk in the column where the code value normally resides. So in that column I have found (besides codes): * Code ranges (e.g. 1-100, 110-12) * References to outside sources (WHO in particular) * The word BLANK, presumably to indicate the field is optional? * .. * * * Comments, sometimes with HTML markups Looking at the heron naaccr_txform and naaccr_concepts_load scripts, the ultimate NAACCR ontology consists of all unique base/concept codes found in the imported NAACCR data file (now in table tumor_reg_codes) merged with some fields from either the NAACCR code table (naaccr.t_code) or one of the external tables (e.g., WHO.TOPO). So, questions arise: 1) First, is the above description reasonably accurate? Please point out any glaring errors. I did leave out some detail like special cases. 2) I looked at WHO's site, and began applying for access to their tables as listed in heron\heron_staging\tumor_reg\icd_o_meta.py . However, before I continue, will every site have to do this individually? Tom Mish is of the opinion YES. But, does anyone have a legal/kosher/ethical way for us to not duplicate this effort? 3) If I'm correct that only codes that appear in the Registry data will be loaded into the ontology, well, is that OK? This fixes the problems with code ranges and non-code values, and the ontology tree is effectively pre-trimmed. But what are the implications for future loads of new data? Has this discussion already happened? 4) Bonus question: What the heck is naaccr_shortcuts.csv? It looks very useful, I have no idea what for. Thanks in advance for any input all of y'all provide. Patrick Lenon HIMC Informatics Specialist 608 890 5671 ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: Questions on KUMC's NAACCR_txform.sql
I'm back to this now and I'm still in a fog re: naaccr.extract_eav. I can't find any definition of it in the KUMC before it pops up in naaccr_txform.sql. Where could I find the DDL that defines this view? From: Munns, Michael B [mailto:mike.mu...@unmc.edu] Sent: Monday, October 20, 2014 3:55 PM To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql The naaccr_extract_eav is a view that is in the script to create the extract table, naaccr_extract.sql. Michael Munns Database Analyst 402-559-3821 From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick Sent: Monday, October 20, 2014 1:05 PM To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Thx for all the info, Nate. I'll take these to the gpc-dev list. Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle out looking at your load/transform scripts, specifically: https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql Would you mind looking at those when convenient? Thanks. I also have some questions about the CDM for NAACCR up on Babel, but I'll put them on a new thread. From: Nathan Graham [mailto:ngra...@kumc.edu] Sent: Monday, October 20, 2014 12:45 PM To: Lenon Patrick; Dan Connolly Cc: 'Kowalski, George'; Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Dan, could you help with the questions below? Patrick, Dan Connolly is probably the most knowledgeable about NAACCR/HERON. Also, perhaps we could move this conversation to the gpc-dev list? Others might benefit from the conversation as well. From: Lenon Patrick [mailto:ple...@uwhealth.org] Sent: Friday, October 17, 2014 11:48 AM To: Nathan Graham Cc: 'Kowalski, George'; Mish Thomas F Subject: Questions on KUMC's NAACCR_txform.sql Hi Nathan, I have a pretty good grasp of your transform proc now but I have a few nagging questions, please reply when convenient: 1) You reference tables naaccr.t_item and naaccr.t_section. I suspect they are just normalized tables created from the NAACCR data dictionary (see attached) with a separate table for the section types. If so, did you generate the Section_ID field yourself? It's the only one I don't see in NAACCR. 2) The tumor_item_value (aka TIV) view is based on table naaccr_extract_eav. The view tumor_reg_facts then merges TIV with naaccr.extract. So, what's naaccr_extract_eav? My guess is that it's a clone of naaccr.extract that contains properly transformed and formatted fact values. If so (or if I'm even close), where does naaccr_extract_eav get loaded up and transformed? If you can help me with these couple of items I'll be way ahead on this. Thanks for leading the way. Patrick Lenon HIMC Informatics Specialist 608 890 5671 The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: Questions on KUMC's NAACCR_txform.sql
I must have been covering up completely confused by acting content. ;) I'll talk to Tom Mish about the MultiSiteDev thing. Thanks for the assistance. From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Monday, November 03, 2014 1:24 PM To: Lenon Patrick; 'Munns, Michael B'; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql I was a little surprised when it looked like you were content with the earlier answers; as far as I know, you don't participate in the MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev process, which gives you access to code such as naaccr_extract.sql. I just added attachment:naaccr_staging.ziphttps://informatics.gpcnetwork.org/trac/Project/attachment/ticket/44/naaccr_staging.zip to the portable HERON ETL for NAACCR ticket (#44https://informatics.gpcnetwork.org/trac/Project/ticket/44). It has a Makefile that you can use to download the spec and generate naaccr_extract.sql from that spec. Unfortunately, the license terms for redistributing naaccr_extract.sql itself are unclear. -- Dan From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Monday, November 03, 2014 11:28 AM To: 'Munns, Michael B'; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql I'm back to this now and I'm still in a fog re: naaccr.extract_eav. I can't find any definition of it in the KUMC before it pops up in naaccr_txform.sql. Where could I find the DDL that defines this view? From: Munns, Michael B [mailto:mike.mu...@unmc.edu] Sent: Monday, October 20, 2014 3:55 PM To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql The naaccr_extract_eav is a view that is in the script to create the extract table, naaccr_extract.sql. Michael Munns Database Analyst 402-559-3821 From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick Sent: Monday, October 20, 2014 1:05 PM To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Thx for all the info, Nate. I'll take these to the gpc-dev list. Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle out looking at your load/transform scripts, specifically: https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql Would you mind looking at those when convenient? Thanks. I also have some questions about the CDM for NAACCR up on Babel, but I'll put them on a new thread. From: Nathan Graham [mailto:ngra...@kumc.edu] Sent: Monday, October 20, 2014 12:45 PM To: Lenon Patrick; Dan Connolly Cc: 'Kowalski, George'; Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Dan, could you help with the questions below? Patrick, Dan Connolly is probably the most knowledgeable about NAACCR/HERON. Also, perhaps we could move this conversation to the gpc-dev list? Others might benefit from the conversation as well. From: Lenon Patrick [mailto:ple...@uwhealth.org] Sent: Friday, October 17, 2014 11:48 AM To: Nathan Graham Cc: 'Kowalski, George'; Mish Thomas F Subject: Questions on KUMC's NAACCR_txform.sql Hi Nathan, I have a pretty good grasp of your transform proc now but I have a few nagging questions, please reply when convenient: 1) You reference tables naaccr.t_item and naaccr.t_section. I suspect they are just normalized tables created from the NAACCR data dictionary (see attached) with a separate table for the section types. If so, did you generate the Section_ID field yourself? It's the only one I don't see in NAACCR. 2) The tumor_item_value (aka TIV) view is based on table naaccr_extract_eav. The view tumor_reg_facts then merges TIV with naaccr.extract. So, what's naaccr_extract_eav? My guess is that it's a clone of naaccr.extract that contains properly transformed and formatted fact values. If so (or if I'm even close), where does naaccr_extract_eav get loaded up and transformed? If you can help me with these couple of items I'll be way ahead on this. Thanks for leading the way. Patrick Lenon HIMC Informatics Specialist 608 890 5671 The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
RE: Questions on KUMC's NAACCR_txform.sql
Excellent, thanks. From: Munns, Michael B [mailto:mike.mu...@unmc.edu] Sent: Monday, October 20, 2014 3:55 PM To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql The naaccr_extract_eav is a view that is in the script to create the extract table, naaccr_extract.sql. Michael Munns Database Analyst 402-559-3821 From: gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu [mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick Sent: Monday, October 20, 2014 1:05 PM To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu' Cc: Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Thx for all the info, Nate. I'll take these to the gpc-dev list. Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle out looking at your load/transform scripts, specifically: https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql Would you mind looking at those when convenient? Thanks. I also have some questions about the CDM for NAACCR up on Babel, but I'll put them on a new thread. From: Nathan Graham [mailto:ngra...@kumc.edu] Sent: Monday, October 20, 2014 12:45 PM To: Lenon Patrick; Dan Connolly Cc: 'Kowalski, George'; Mish Thomas F Subject: RE: Questions on KUMC's NAACCR_txform.sql Dan, could you help with the questions below? Patrick, Dan Connolly is probably the most knowledgeable about NAACCR/HERON. Also, perhaps we could move this conversation to the gpc-dev list? Others might benefit from the conversation as well. From: Lenon Patrick [mailto:ple...@uwhealth.org] Sent: Friday, October 17, 2014 11:48 AM To: Nathan Graham Cc: 'Kowalski, George'; Mish Thomas F Subject: Questions on KUMC's NAACCR_txform.sql Hi Nathan, I have a pretty good grasp of your transform proc now but I have a few nagging questions, please reply when convenient: 1) You reference tables naaccr.t_item and naaccr.t_section. I suspect they are just normalized tables created from the NAACCR data dictionary (see attached) with a separate table for the section types. If so, did you generate the Section_ID field yourself? It's the only one I don't see in NAACCR. 2) The tumor_item_value (aka TIV) view is based on table naaccr_extract_eav. The view tumor_reg_facts then merges TIV with naaccr.extract. So, what's naaccr_extract_eav? My guess is that it's a clone of naaccr.extract that contains properly transformed and formatted fact values. If so (or if I'm even close), where does naaccr_extract_eav get loaded up and transformed? If you can help me with these couple of items I'll be way ahead on this. Thanks for leading the way. Patrick Lenon HIMC Informatics Specialist 608 890 5671 The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender. ___ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev