OK, I got the answer to my question (“Do we exclude free text fields?”), which is “Yes.” Thx.
Now, your questions: Our basic ETL to get UW NAACCR data into the database is in Informatica, and mostly resolves differences between UW field names and official NAACCR field names. I also do the de-identification step in Informatica. From then on I largely follow the naaccr_txform script. I used your code to include/exclude NAACCR sections and items based on numerous criteria, but I don’t believe I see item format of “free text” as one of the criteria. So my fields included 310 and 320, which I can easily fix. I did vary somewhat in an attempt to build in some flexibility. Results are essentially the same as your system, so this may be largely academic, but you asked, so: Where I varied was an attempt to set up my NAACCR metadata in such a way that I wouldn’t have to hard code which NAACCR sections and items were included in the EAV. To do so I added a logical “ONTOLOGY” field to the section and item tables like so (apologies for formatting): DROP TABLE naaccr_tblSection ; CREATE TABLE naaccr_tblSection ( Section_ID integer not null primary key , Section varchar(46) , SectionPathName varchar(212) , SECTION_ONTOLOGY varchar(1) , SectionBaseCode varchar(46) ); DROP TABLE naaccr_tblItem ; CREATE TABLE naaccr_tblItem ( ItemID integer not null primary key, ItemNbr integer, ItemName varchar(512), AllowValue varchar(204), ItemPathName varchar(212), FieldLength integer, SectionID integer, ItemFormat varchar(125), Item_Ontology varchar(1) , ItemBaseCode varchar(212) ); DROP TABLE naaccr_tblCode ; CREATE TABLE naaccr_tblCode ( CodeID integer not null primary key, ItemID integer, CodeNbr varchar(212), CodePath varchar(212), CodeDcrp varchar(198) ); /* To see what NAACCR items are included in our extract, run: */ create or replace view vw_i2b2_naaccr_active_items as (select s.section_id, s.section, i.itemid, i.itemnbr, i.itemname, i.itempathname, i.ITEMFORMAT from naaccr_tblitem i join naaccr_tblsection s on i.SECTIONID = s.SECTION_ID where i.ITEM_ONTOLOGY = 'Y' AND s.SECTION_ONTOLOGY = 'Y') ; After I load the NAACCR tables, I set the initial value for the ONTOLOGY fields according to the criteria in naaccr_txform.sql. So when creating my extract_eav equivalent, instead of the hard coded section selection “and ns.SectionID in ( 1 -- Cancer Identification , 2 – Demographic , etc…” I select sections and items like so: “ from naaccr_tblitem i, naaccr_tblsection s where i.sectionID = s.section_ID and s.section_id is not null and s.section_ontology = 'Y' and i.item_ontology = 'Y' and i.FIELDLENGTH is not null and i.ITEMBASECODE is not null -- another way of excluding a field/item order by s.section_id, i.ITEMNBR “ So, the main advantage is I can select a full section while excluding selected items in the section. From: Dan Connolly [mailto:dconno...@kumc.edu] Sent: Monday, March 09, 2015 9:25 AM To: Lenon Patrick; gpc-dev@listserv.kumc.edu Subject: RE: NAACCR - Free text fields Perhaps you could back up and explain your overall approach to the NAACCR ETL? What code are you using? Where can I look at it? Perhaps you've discussed this before, but I don't see any pointers to context in this message. Did you try the HERON code? If not, why not? If so, what happened when you tried? As to this specific question, it's documented on the TumorRegistry<https://informatics.kumc.edu/work/wiki/TumorRegistry> page: We reviewed the data we get by section to eliminate potentially sensitive data, including free-text; the sections with a -- below are not loaded into HERON: followed the relevant code excerpt from source:heron_load/naaccr_txform.sql#L67<https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql#L67><https://informatics.kumc.edu/work/export/HEAD/heron_load/naaccr_txform.sql#L67> -- Dan ________________________________ From: gpc-dev-boun...@listserv.kumc.edu<mailto:gpc-dev-boun...@listserv.kumc.edu> [gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick [ple...@uwhealth.org] Sent: Monday, March 09, 2015 9:17 AM To: gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu> Subject: NAACCR - Free text fields In trying to ensure all my Tumor Registry fact table items have corresponding concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual Occupation) which has a format of “Free text.” As you’d expect there are no entries in the NAACCR metadata for that. However, following the Heron fact load code, I created a whole bunch of facts with concept codes like “NAACCR|310:(n)TH GRADE TEACHER – (small Wisconsin town) SCHOOL DISTRICT” To my semi-trained eye this looks like it would be pretty useless to I2B2. So I’m wondering what other sites do in similar situations. Possibilities that have occurred to me already are: 1) Exclude all “free text” format fields from the fact load. 2) Leave them in, hoping for codification someday Is there any reason NOT to exclude free text fields? Or some criteria to include some and exclude others? Patrick Lenon HIMC Informatics Specialist 608 890 5671
_______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev