OK, I got the answer to my question (“Do we exclude free text fields?”), which 
is “Yes.”  Thx.

Now, your questions:  Our basic ETL to get UW NAACCR data into the database is 
in Informatica, and mostly resolves differences between UW field names and 
official NAACCR field names.  I also do the de-identification step in 
Informatica.  From then on I largely follow the naaccr_txform script.  I used 
your code to include/exclude NAACCR sections and items based on numerous 
criteria, but I don’t believe I see item format of “free text” as one of the 
criteria.  So my fields included 310 and 320, which I can easily fix.

I did vary somewhat in an attempt to build in some flexibility.  Results are 
essentially the same as your system, so this may be largely academic, but you 
asked, so:

Where I varied was an attempt to set up my NAACCR metadata in such a way that I 
wouldn’t have to hard code which NAACCR sections and items were included in the 
EAV.   To do so I added a logical “ONTOLOGY” field to the section and item 
tables like so (apologies for formatting):

DROP TABLE naaccr_tblSection ;
CREATE TABLE naaccr_tblSection
(
    Section_ID integer not null primary key
    , Section    varchar(46)
                , SectionPathName         varchar(212)
                , SECTION_ONTOLOGY varchar(1)
                , SectionBaseCode varchar(46)
);


DROP TABLE naaccr_tblItem ;
CREATE TABLE naaccr_tblItem
(
    ItemID                    integer not null primary key,
    ItemNbr                   integer,
    ItemName                  varchar(512),
   AllowValue                                                        
varchar(204),
   ItemPathName                                    varchar(212),
    FieldLength               integer,
    SectionID                 integer,
ItemFormat                    varchar(125),
                Item_Ontology             varchar(1)
                , ItemBaseCode                                                
varchar(212)
);


DROP TABLE naaccr_tblCode ;
CREATE TABLE naaccr_tblCode
(
    CodeID      integer not null primary key,
    ItemID      integer,
                CodeNbr                              varchar(212),
                CodePath            varchar(212),
    CodeDcrp    varchar(198)
);


  /* To see what NAACCR items are included in our extract, run:  */
  create or replace view vw_i2b2_naaccr_active_items as
(select s.section_id, s.section, i.itemid, i.itemnbr, i.itemname, 
i.itempathname, i.ITEMFORMAT
from naaccr_tblitem i
  join naaccr_tblsection s
  on i.SECTIONID = s.SECTION_ID
  where i.ITEM_ONTOLOGY = 'Y'
  AND s.SECTION_ONTOLOGY = 'Y') ;

After I load the NAACCR tables, I set the initial value for the ONTOLOGY fields 
according to the criteria in naaccr_txform.sql.

So when creating my extract_eav equivalent, instead of the hard coded section 
selection

“and ns.SectionID in (
  1 -- Cancer Identification
, 2 – Demographic
, etc…”

I select sections and items like so:

“ from
 naaccr_tblitem i,
 naaccr_tblsection s
where i.sectionID = s.section_ID
  and s.section_id is not null
  and s.section_ontology = 'Y'
  and i.item_ontology = 'Y'
  and i.FIELDLENGTH is not null
  and i.ITEMBASECODE is not null   -- another way of excluding a field/item
  order by s.section_id, i.ITEMNBR “

So, the main advantage is I can select a full section while excluding selected 
items in the section.


From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Monday, March 09, 2015 9:25 AM
To: Lenon Patrick; gpc-dev@listserv.kumc.edu
Subject: RE: NAACCR - Free text fields

Perhaps you could back up and explain your overall approach to the NAACCR ETL? 
What code are you using? Where can I look at it? Perhaps you've discussed this 
before, but I don't see any pointers to context in this message. Did you try 
the HERON code? If not, why not? If so, what happened when you tried?

As to this specific question, it's documented on the 
TumorRegistry<https://informatics.kumc.edu/work/wiki/TumorRegistry> page:
We reviewed the data we get by section to eliminate potentially sensitive data, 
including free-text; the sections with a -- below are not loaded into HERON:

followed the relevant code excerpt from 
source:heron_load/naaccr_txform.sql#L67<https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql#L67>​<https://informatics.kumc.edu/work/export/HEAD/heron_load/naaccr_txform.sql#L67>

--
Dan
________________________________
From: 
gpc-dev-boun...@listserv.kumc.edu<mailto:gpc-dev-boun...@listserv.kumc.edu> 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Monday, March 09, 2015 9:17 AM
To: gpc-dev@listserv.kumc.edu<mailto:gpc-dev@listserv.kumc.edu>
Subject: NAACCR - Free text fields
In trying to ensure all my Tumor Registry fact table items have corresponding 
concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual 
Occupation) which has a format of “Free text.”  As you’d expect there are no 
entries in the NAACCR metadata for that.  However, following the Heron fact 
load code, I created a whole bunch of facts with concept codes like 
“NAACCR|310:(n)TH GRADE TEACHER – (small Wisconsin town) SCHOOL DISTRICT”

To my semi-trained eye this looks like it would be pretty useless to I2B2.  So 
I’m wondering what other sites do in similar situations.  Possibilities that 
have occurred to me already are:

1)      Exclude all “free text” format fields from the fact load.

2)      Leave them in, hoping for codification someday

Is there any reason NOT to exclude free text fields?  Or some criteria to 
include some and exclude others?


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to