Duplicates or synonyms in NAACCR ICD_O_MORPH?

2015-05-04 Thread Lenon Patrick
In QA-ing our NAACCR data, we found some apparent duplicates in the NAACCR 
ontology as produced by the heron code.  Duplicates being defined as two 
metadata records with the same c_fullname (not synonyms).

These appear to be caused by minor differences in spelling and punctuation in 
ICD_O_MORPH, like tumor vs. tumour.

For example, this query:

SELECT *
  FROM I2B2_DEV_ETL..ICD_O_MORPH icdo
  where icdo.CONCEPT_NAME like '%8010%'
  order by concept_cd;

yields records with concept_name of '8010/0 Epithelial tumour, benign' and 
'8010/0 Epithelial tumor, benign' with the only difference being the English 
spelling of tumour.

There are other minor differences like
'8010/6 Carcinoma, metastatic NOS'
'8010/6 Carcinoma, metastatic, NOS'  /* extra comma */

This in turn was caused by slight differences between MORPH2 and MORPH3, aka 
ICD-0-2 and ICD-0-3.

So what if anything did you folks do with this?  Essentially they're synonyms 
(if I understand I2B2 synonyms correctly).  Are they useful as such?  Or did 
you just wind up nuking the extras on more or less random criteria (like 
getting rid of all the tumour entries or some such)?




Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


NAACCR 0521 Morphology - extra values

2015-03-10 Thread Lenon Patrick
On checking fact table concepts vs. concept dimension for UW NAACCR, we have 
~200 fact records (out of 60,000) whose concepts don't match up.  In other 
words, their 521-Morphology code value (actually 522 - Histology plus 
523-Behavior) doesn't match any value in the ICDO_O_MORPH table set up in the 
naaccr_concepts_load.sql script.  Examples are 98153, 80412, 80722.  The most 
likely explanation is data errors of some kind.

So first, have others experienced this sort of thing with field 521?  Are there 
other possible explanations?

Second, assuming this data is defective, is there a protocol or principle to 
follow with such records?  i.e., gently correct them vs. nuke them on sight?


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


NAACCR - Free text fields

2015-03-09 Thread Lenon Patrick
In trying to ensure all my Tumor Registry fact table items have corresponding 
concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual 
Occupation) which has a format of Free text.  As you'd expect there are no 
entries in the NAACCR metadata for that.  However, following the Heron fact 
load code, I created a whole bunch of facts with concept codes like 
NAACCR|310:(n)TH GRADE TEACHER - (small Wisconsin town) SCHOOL DISTRICT

To my semi-trained eye this looks like it would be pretty useless to I2B2.  So 
I'm wondering what other sites do in similar situations.  Possibilities that 
have occurred to me already are:

1)  Exclude all free text format fields from the fact load.

2)  Leave them in, hoping for codification someday

Is there any reason NOT to exclude free text fields?  Or some criteria to 
include some and exclude others?


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: NAACCR - Free text fields

2015-03-09 Thread Lenon Patrick
OK, I got the answer to my question (“Do we exclude free text fields?”), which 
is “Yes.”  Thx.

Now, your questions:  Our basic ETL to get UW NAACCR data into the database is 
in Informatica, and mostly resolves differences between UW field names and 
official NAACCR field names.  I also do the de-identification step in 
Informatica.  From then on I largely follow the naaccr_txform script.  I used 
your code to include/exclude NAACCR sections and items based on numerous 
criteria, but I don’t believe I see item format of “free text” as one of the 
criteria.  So my fields included 310 and 320, which I can easily fix.

I did vary somewhat in an attempt to build in some flexibility.  Results are 
essentially the same as your system, so this may be largely academic, but you 
asked, so:

Where I varied was an attempt to set up my NAACCR metadata in such a way that I 
wouldn’t have to hard code which NAACCR sections and items were included in the 
EAV.   To do so I added a logical “ONTOLOGY” field to the section and item 
tables like so (apologies for formatting):

DROP TABLE naaccr_tblSection ;
CREATE TABLE naaccr_tblSection
(
Section_ID integer not null primary key
, Sectionvarchar(46)
, SectionPathName varchar(212)
, SECTION_ONTOLOGY varchar(1)
, SectionBaseCode varchar(46)
);


DROP TABLE naaccr_tblItem ;
CREATE TABLE naaccr_tblItem
(
ItemIDinteger not null primary key,
ItemNbr   integer,
ItemName  varchar(512),
   AllowValue
varchar(204),
   ItemPathNamevarchar(212),
FieldLength   integer,
SectionID integer,
ItemFormatvarchar(125),
Item_Ontology varchar(1)
, ItemBaseCode
varchar(212)
);


DROP TABLE naaccr_tblCode ;
CREATE TABLE naaccr_tblCode
(
CodeID  integer not null primary key,
ItemID  integer,
CodeNbr  varchar(212),
CodePathvarchar(212),
CodeDcrpvarchar(198)
);


  /* To see what NAACCR items are included in our extract, run:  */
  create or replace view vw_i2b2_naaccr_active_items as
(select s.section_id, s.section, i.itemid, i.itemnbr, i.itemname, 
i.itempathname, i.ITEMFORMAT
from naaccr_tblitem i
  join naaccr_tblsection s
  on i.SECTIONID = s.SECTION_ID
  where i.ITEM_ONTOLOGY = 'Y'
  AND s.SECTION_ONTOLOGY = 'Y') ;

After I load the NAACCR tables, I set the initial value for the ONTOLOGY fields 
according to the criteria in naaccr_txform.sql.

So when creating my extract_eav equivalent, instead of the hard coded section 
selection

“and ns.SectionID in (
  1 -- Cancer Identification
, 2 – Demographic
, etc…”

I select sections and items like so:

“ from
 naaccr_tblitem i,
 naaccr_tblsection s
where i.sectionID = s.section_ID
  and s.section_id is not null
  and s.section_ontology = 'Y'
  and i.item_ontology = 'Y'
  and i.FIELDLENGTH is not null
  and i.ITEMBASECODE is not null   -- another way of excluding a field/item
  order by s.section_id, i.ITEMNBR “

So, the main advantage is I can select a full section while excluding selected 
items in the section.


From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Monday, March 09, 2015 9:25 AM
To: Lenon Patrick; gpc-dev@listserv.kumc.edu
Subject: RE: NAACCR - Free text fields

Perhaps you could back up and explain your overall approach to the NAACCR ETL? 
What code are you using? Where can I look at it? Perhaps you've discussed this 
before, but I don't see any pointers to context in this message. Did you try 
the HERON code? If not, why not? If so, what happened when you tried?

As to this specific question, it's documented on the 
TumorRegistryhttps://informatics.kumc.edu/work/wiki/TumorRegistry page:
We reviewed the data we get by section to eliminate potentially sensitive data, 
including free-text; the sections with a -- below are not loaded into HERON:

followed the relevant code excerpt from 
source:heron_load/naaccr_txform.sql#L67https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql#L67​https://informatics.kumc.edu/work/export/HEAD/heron_load/naaccr_txform.sql#L67

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Monday, March 09, 2015 9:17 AM
To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: NAACCR - Free text fields
In trying to ensure all my Tumor Registry fact table items have corresponding 
concept codes in the Concept Dimension, I found NAACCR item 310 (Text-Usual 
Occupation) which has a format of “Free text.”  As you’d expect there are no 
entries in the NAACCR metadata

RE: NAACCR Ontology - Accession and Seq No?

2015-02-05 Thread Lenon Patrick
I don't know if there's an actual bug in your code, or if it's just a part of 
the Heron code I couldn't use as is.  Your tumor_item_value is a view while I 
had to do mine as a table (Netezza insisted), for instance, so you may have 
something in the view definition that was added later.  There could also be 
something in the definition of the NAACCR fact table (naaccr.extract_eav) that 
tumor_item_value draws on.

At any rate, I've already excluded those fields from my ontology.  Thx for the 
response.

From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Thursday, February 05, 2015 4:30 PM
To: Lenon Patrick; gpc-dev@listserv.kumc.edu
Subject: RE: NAACCR Ontology - Accession and Seq No?

No, the accession number should definitely not be in the observation_fact 
table. That would be a bad bug.

I'm trying to reproduce it here...

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Thursday, February 05, 2015 10:42 AM
To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: NAACCR Ontology - Accession and Seq No?
Hello again all,
In building  NAACCR concepts via my modified versions of KUMC's 
naaccr_txform.sql and naaccr_concepts_load.sql, I'm producing concepts and 
facts for each  value of both Accession No. and Hospital Sequence No.  This 
seems unnecessary, since I can't see accession number being part of a 
researcher's query, not to mention possible de-identification issues.

So, my questions are:

1)  Is this actually the correct outcome?

2)  Assuming it isn't, did I miss a step where these two fields (or their 
section, 6 Hospital-Specific) are filtered out of the fact and concept builds?  
I'd expect that to be in the definition of the big flat view aka 
tumor_item_value.

Awaiting your input eagerly.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


NAACCR Ontology - Accession and Seq No?

2015-02-05 Thread Lenon Patrick
Hello again all,
In building  NAACCR concepts via my modified versions of KUMC's 
naaccr_txform.sql and naaccr_concepts_load.sql, I'm producing concepts and 
facts for each  value of both Accession No. and Hospital Sequence No.  This 
seems unnecessary, since I can't see accession number being part of a 
researcher's query, not to mention possible de-identification issues.

So, my questions are:

1)  Is this actually the correct outcome?

2)  Assuming it isn't, did I miss a step where these two fields (or their 
section, 6 Hospital-Specific) are filtered out of the fact and concept builds?  
I'd expect that to be in the definition of the big flat view aka 
tumor_item_value.

Awaiting your input eagerly.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: NAACCR Encounter type?

2015-01-30 Thread Lenon Patrick
Thanks, I think you answered my followup as well.  VERY efficient.  ;)


From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Friday, January 30, 2015 9:44 AM
To: Lenon Patrick; gpc-dev@listserv.kumc.edu
Subject: RE: NAACCR Encounter type?

We discussed encounters a bit at HackathonTwo; the only thing I remember is 
that I'm supposed to split ticket #155. More on that in due course. Meanwhile...

It's by design (limitation) that NAACR encounters don't show up in the 
Encounters portion of the  i2b2 tree, at least the KUMC i2b2 tree.

And yes, they're unique; i.e. we don't (yet) support same-encounter queries 
between the tumour registry and anything else; we're considering 
per-patient-day encounter support that should result in a bit of integration. 
(for KUMC folks: #333 and nearby tickets tagged encounter-mapping)

Encounter type support is weak in HERON ETL. The only encounter types you'll 
see from HERON ETL come from Epic, and then only for in-patient, I think. In 
the visit_dimension generated by HERON ETL, the encounter types are all null. I 
hope other (Hubert, ...) take the lead on this and produce something we can 
integrate into HERON ETL.

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Friday, January 30, 2015 9:12 AM
To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: NAACCR Encounter type?
Hey all,

I'm close to loading our NAACCR data into our I2B2 instance, which involves 
generating encounter records (per the tumor_reg_visits table referenced in 
KUMC's naaccr_facts_load.sql).  These encounters don't line up with any 
existing encounters, they're all unique.

My question is, should these NAACCR encounters be represented in the 
Encounters portion of the I2B2 tree?  I checked Babel and spot-checked a few 
setups and didn't see any reference to NAACCR under Encounter Type, which is 
where I'd expect to find it.

Has this been considered/discussed/rejected?  Your thoughts are welcome.



Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: HackathonTwo follow-up action items

2015-01-29 Thread Lenon Patrick
Regarding the #44 Portable Heron notes, one situation I’ve dealt with is 
references to a table which I do not have.  From naaccr_concepts_load.sql, for 
instance:

(select * from BlueHeronData.source_master@deid
   where source_cd like 'tumor_registry@%')

I don’t have a “source_master” table, and can’t begin to guess what “*” might 
comprise.  (OK, I know one field is called “source_cd”.) In other cases I’ve 
been able to infer what’s missing, or what our local name for a given table is, 
but this one stopped me cold.

For a database script to be portable, there need to be definitions somewhere of 
the tables and fields referenced.  This could be in the form of a data 
dictionary, or it could be simply output of data definition language (DDL) from 
the tables referenced.  Oftimes the first script run does all the DDL up front, 
with data manipulation language (DML) in a separate script that can be run 
iteratively.

Apologies if I missed something obvious.  Peril of being the “new kid”…

p.s. If I did miss it,kindly point it out using small words and simple 
gestures.  ;)



From: gpc-dev-boun...@listserv.kumc.edu 
[mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Dan Connolly
Sent: Tuesday, January 27, 2015 9:46 AM
To: gpc-dev@listserv.kumc.edu
Subject: HackathonTwo follow-up action items

I'm going over the meeting 
noteshttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo#record 
(thanks, Laurel!) looking especially for follow-up items. I added comments that 
some of them aren't quite clear (as well as presentation materials to find, 
etc.)

This is a summary of follow-up action items that I found so far:

Names in section headings indicate follow-up item ownership.



Day 1 - Thu, Jan 
22https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.w3yrm6qir16q

Introductions, Opening Remarks - GPC Phase 2 
LOIhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.46n62mizfpxm

AM Session 1 - GPC CDM ETL (Campbell, 
Graham?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.1ahl9lh7w5ei

AM Session 1 - Heron Code 
Sharinghttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.zdilnib5ekxc

identified i2b2 
(Mosa@MU)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.523suzwa3zeu

AM Session 2 - Breast Cancer Survey Finder FIle (Kowalski, UMN, UIOWA, MCRF, 
UTSW, 
McMahon)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.nf8258clqxww

AM Session 2 - Obesity, BMI percentile 
(?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.okvaig1ygap4

AM Session 2 - Terminology Mapping 
Strategieshttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.c9scujdqyl8e

PM Session 1 - Terminologies (Reeder, Campbell, 
Connolly)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.bl4ohqy5t9xr

PM Session 2 - Data Quality 
(?)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.v6pigqrdquys

PM Session 2 - 
Encountershttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.qhq7fmqwj9xi

PM Session 2 - Text Deidentification (Jacquie @ 
MCRF)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.mhfn3n5l9hx

DAY 
2https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.4te681ghpj2u

AM Session 1 - Usable LOINC Lab Hierarchy - 
(Apathy)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.5yzy9wlfcb19

AM Session 1 - NLP/Text Notes Code 
Sharinghttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.ydcola9j7ok

AM Session 2 - Federated login 
(Mish)https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.wnkejbhu1tmr

AM Session 2 - Building Analytic 
Datasetshttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.u7ngvpz55k2i

Using heron_extract to reshape data for use in 
REDCaphttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.nvkllaqj6uky

Analyzable 
Datahttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.tfawvpu2vbot

Data Analyzer User 
Interfacehttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.dpnvwd4jzxgm

PM Session 1 - EMR 
Integrationhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.xy7jgds0qjnw
PM Session 
2https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.q0osz6mvtv3c
--
Dan

RE: code sharing an HERON ETL documentation (was: HackathonTwo follow-up action items)

2015-01-29 Thread Lenon Patrick
I don’t see a source_master table here, so I’ll have to check that with Keith 
when he gets back.  I was able to finesse it for the short term, anyway.

I did succeed in finding and replicating the “normal_concept”  view also 
referenced in naaccr_concepts_load.

Thanks for the feedback.  I’ll try to spend a bit more time on multisitedev.  I 
should mention that the link for “internal onboarding/training notes” doesn’t 
work.  I think we discussed this earlier.  
https://bmi-work.kumc.edu/work/ticket/1247#comment:16


From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Thursday, January 29, 2015 2:24 PM
To: Lenon Patrick; gpc-dev@listserv.kumc.edu
Subject: RE: code sharing an HERON ETL documentation (was: HackathonTwo 
follow-up action items)

I don't doubt that there are lots of undocumented aspects of HERON ETL, but the 
issue here seems more like the fact that 20,000 lines of code is a lot to get 
your head around, especially when the number of contributors has been in the 
single digits for a long time, so we haven't put a lot of emphasis on the new 
developer experience. 
MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev 
is an attempt to address this issue. I wouldn't mind devoting more meeting time 
to this sort of thing.

In particular:

source_master is part of i2b2:

edu.harvard.i2b2.data grep -ri source_master .
./Release_1-7/NewInstall/Crcdata/scripts/crc_create_uploader_oracle.sql:-- 
Table: SOURCE_MASTER
./Release_1-7/NewInstall/Crcdata/scripts/crc_create_uploader_oracle.sql:CREATE 
TABLE SOURCE_MASTER (
...

So I'd be surprised if you really don't have such a table.

The fact that HERON ETL relies on such tables is documented in the module 
header of 
heron_create.pyhttps://informatics.kumc.edu/work/browser/heron_load/heron_create.py:
Database initialization scripts from i2b2 sources are used for this process::

options = _option  # un-hide for testing
options.i2b2_source
   'mock_i2b2_source'
The relevant task* is create_deid_schema starting on line 
193https://informatics.kumc.edu/work/browser/heron_load/heron_create.py#L193:
@task
def create_deid_schemas(options):
'''Create schemas for de-identified datamart.

Note Well: Any existing schema is destroyed.
.. todo:: consider checking  that there's no valuable data
It calls _create_datamart, where we see crc_create_uploader_oracle.sql 
mentioned by name.


* Paver tasks and 
dependencieshttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev#Tasksanddependencies
 are discussed in 
MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev.

--
Dan

From: Lenon Patrick [ple...@uwhealth.org]
Sent: Thursday, January 29, 2015 1:44 PM
To: Dan Connolly; gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: RE: HackathonTwo follow-up action items
Regarding the #44 Portable Heron notes, one situation I’ve dealt with is 
references to a table which I do not have.  From naaccr_concepts_load.sql, for 
instance:

(select * from 
BlueHeronData.source_master@deidmailto:BlueHeronData.source_master@deid
   where source_cd like 'tumor_registry@%')

I don’t have a “source_master” table, and can’t begin to guess what “*” might 
comprise.  (OK, I know one field is called “source_cd”.) In other cases I’ve 
been able to infer what’s missing, or what our local name for a given table is, 
but this one stopped me cold.

For a database script to be portable, there need to be definitions somewhere of 
the tables and fields referenced.  This could be in the form of a data 
dictionary, or it could be simply output of data definition language (DDL) from 
the tables referenced.  Oftimes the first script run does all the DDL up front, 
with data manipulation language (DML) in a separate script that can be run 
iteratively.

Apologies if I missed something obvious.  Peril of being the “new kid”…

p.s. If I did miss it,kindly point it out using small words and simple 
gestures.  ;)



From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Dan Connolly
Sent: Tuesday, January 27, 2015 9:46 AM
To: gpc-dev@listserv.kumc.edumailto:gpc-dev@listserv.kumc.edu
Subject: HackathonTwo follow-up action items

I'm going over the meeting 
noteshttps://informatics.gpcnetwork.org/trac/Project/wiki/HackathonTwo#record 
(thanks, Laurel!) looking especially for follow-up items. I added comments that 
some of them aren't quite clear (as well as presentation materials to find, 
etc.)

This is a summary of follow-up action items that I found so far:

Names in section headings indicate follow-up item ownership.



Day 1 - Thu, Jan 
22https://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.w3yrm6qir16q

Introductions, Opening Remarks - GPC Phase 2 
LOIhttps://docs.google.com/document/d/13dA_ml1GSIhZ7fs-fWle5dPU95UlFQEoS6HJezSgKyQ/edit#heading=h.46n62mizfpxm

RE: [gpc-informatics] #44: portable HERON ETL for NAACCR

2015-01-05 Thread Lenon Patrick
Thx again guys, very helpful, many blind alleys avoided.  ;)

-Original Message-
From: Dan Connolly [mailto:dconno...@kumc.edu] 
Sent: Monday, January 05, 2015 3:28 PM
To: Lenon Patrick; 'gpc-dev@listserv.kumc.edu'
Cc: Nathan Graham; Mish Thomas F
Subject: RE: [gpc-informatics] #44: portable HERON ETL for NAACCR

1) I don't remember any waiting period, but it was 2 years ago, so who knows.

2) yes

3) No; ICD-10 isn't relevant to NAACCR ETL.

-- 
Dan


From: Lenon Patrick [ple...@uwhealth.org]
Sent: Monday, January 05, 2015 2:17 PM
To: 'gpc-dev@listserv.kumc.edu'; Dan Connolly
Cc: Nathan Graham; Lenon Patrick; Mish Thomas F
Subject: RE: [gpc-informatics] #44: portable HERON ETL for NAACCR

Thx, some additional questions:

1) I have not heard back from the WHO person, and my membership seems to be 
pending or something, so no downloads show as available for now.  Did you have 
to go through some waiting period like this?   (Maybe I mistakenly got in the 
commercial license queue instead of the research license queue?)

2) In your naaccr_concepts_load.sql, you refer to tables who.topo, who.morph2, 
and who.morph3.  Do those correspond to the ICD-O-3_CSV-metadata.zip and 
ICD-O-2_CSV.zip you refer to below?

3) Did you not download ICD-10?

Thanks as always for your assistance.


-Original Message-
From: GPC Informatics [mailto:d...@madmode.com]
Sent: Monday, January 05, 2015 12:59 PM
To: dconno...@kumc.edu; Lenon Patrick
Cc: ngra...@kumc.edu
Subject: Re: [gpc-informatics] #44: portable HERON ETL for NAACCR

#44: portable HERON ETL for NAACCR
--+---
 Reporter:  dconnolly |   Owner:  lenonpat
 Type:  enhancement   |  Status:  assigned
 Priority:  major |   Milestone:  bc-survey-cohort-def
Component:  etl-dev   |  Resolution:
 Keywords:  breast-cancer-cohort  |  Blocked By:
 Blocking:  119   |
--+---
Changes (by dconnolly):

 * owner:  dconnolly = lenonpat


Comment:

 Patrick,

 I found my (30 Jan 2013) notes on downloading materials from WHO...

 The pointer in my notes is
 http://www.who.int/classifications/icd/adaptations/oncology/en/index.html
 (along with
 
http://apps.who.int/classifications/apps/icd/ClassificationDownloadNR/license.htm
 )

 and I recorded the md5sums of what I downloaded

 - `b088c4e4bd2d685c9dd04e3b3c14c98b` ICD-O-3_CSV-metadata.zip
 - `1308ce6f4ef93c67137154cc6a723fc6` ICD-O-2_CSV.zip

--
Ticket URL: http://informatics.gpcnetwork.org/trac/Project/ticket/44#comment:8
gpc-informatics http://informatics.gpcnetwork.org/
Greater Plains Network - Informatics
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: NAACCR metadata, latest go-round

2014-11-18 Thread Lenon Patrick
Great, thanks, especially for the bit about the shortcuts.  That would have 
been tough to suss out on my own.


From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Monday, November 17, 2014 5:43 PM
To: Lenon Patrick; 'Gpc-dev@listserv.kumc.edu'
Subject: RE: NAACCR metadata, latest go-round

1) I remember the data being a little messy... the code only loads the sections 
that I was confident had no PHI (aside from the MRN, which gets deidentified 
later). I eyeballed all the data in one of the views. What a pain. But I 
haven't found any way around it.

As to how the code works... I suppose you're reading it correctly. I haven't 
looked at it in quite a while. If you feel like contributing a design sketch to 
add to the top of naaccr_concepts_load.sql, I can see that it gets code 
reviewed and integrated. (see for example the Medication dispense facts 
comment in epic_meds_transform.sql). Patches welcome, as they say.

2) My recollection is that getting the WHO files is pretty painless (at least 
compared with, say, UMLS); anything we would set up to reduce duplicated effort 
would be at least as much hassle... especially since we'd have to set it up.

3) The code to build the ontology is designed to be run each time the data is 
loaded. (No, I don't recall discussing this in gpc-dev).

4) If you got naaccr_shortcuts.csv from version control on elephant, you can 
follow your nose thru renamed shortcut concepts and reworked staging concept 
hierarchy (#2112https://bmi-work.kumc.edu/work/ticket/2112) to KUMC ticket 
#2112https://informatics.kumc.edu/work/ticket/2112 to the milford 
releasehttps://informatics.kumc.edu/work/milestone/heron-milford-update to... 
darn; there should be a link to the relevant blog article:

  *   HERON Milford simplifies the Cancer Cases 
folderhttps://informatics.kumc.edu/work/blog/heron-milford-update

A new folder, Cancer Cases (Abridged), debuts with the Milford release. This 
folder contains the frequently searched concepts from the Cancer Cases folder. 
Advanced searchers can still search the entire tumor registry in the unabridged 
folder.
Oops; that should be in the list of blog articles on 
TumorRegistryhttps://informatics.kumc.edu/work/wiki/TumorRegistry.

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Monday, November 17, 2014 4:32 PM
To: 'Gpc-dev@listserv.kumc.edu'
Subject: NAACCR metadata, latest go-round
Hi all, hoping some of you are in a good state of mind to share your 
experience, brilliance, shattering good looks... enough flattery?

Anyway, my first attempt at building a NAACCR ontology was disappointing in 
that the NAACCR tables were not as helpful as I expected.  Fundamental problem: 
A lot of junk in the column where the code value normally resides.  So in that 
column I have found (besides codes):

* Code ranges (e.g. 1-100, 110-12)

* References to outside sources (WHO in particular)

* The word BLANK, presumably to indicate the field is optional?

* ..

* *

* Comments, sometimes with HTML markups


Looking at the heron naaccr_txform and naaccr_concepts_load scripts, the 
ultimate NAACCR ontology consists of all unique base/concept codes found in the 
imported NAACCR data file (now in table tumor_reg_codes) merged with some 
fields from either the NAACCR code table (naaccr.t_code) or one of the external 
tables (e.g., WHO.TOPO).

So, questions arise:

1)  First, is the above description reasonably accurate?  Please point out 
any glaring errors.  I did leave out some detail like special cases.

2)  I looked at WHO's site, and began applying for access to their tables 
as listed in heron\heron_staging\tumor_reg\icd_o_meta.py .  However, before I 
continue, will every site have to do this individually?  Tom Mish is of the 
opinion YES.  But, does anyone have a legal/kosher/ethical way for us to not 
duplicate this effort?

3)  If I'm correct that only codes that appear in the Registry data will be 
loaded into the ontology, well, is that OK?  This fixes the problems with code 
ranges and non-code values, and the ontology tree is effectively pre-trimmed.  
But what are the implications for future loads of new data?  Has this 
discussion already happened?

4)  Bonus question:  What the heck is naaccr_shortcuts.csv?  It looks very 
useful, I have no idea what for.


Thanks in advance for any input all of y'all provide.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


NAACCR metadata, latest go-round

2014-11-17 Thread Lenon Patrick
Hi all, hoping some of you are in a good state of mind to share your 
experience, brilliance, shattering good looks... enough flattery?

Anyway, my first attempt at building a NAACCR ontology was disappointing in 
that the NAACCR tables were not as helpful as I expected.  Fundamental problem: 
A lot of junk in the column where the code value normally resides.  So in that 
column I have found (besides codes):

* Code ranges (e.g. 1-100, 110-12)

* References to outside sources (WHO in particular)

* The word BLANK, presumably to indicate the field is optional?

* ..

* *

* Comments, sometimes with HTML markups


Looking at the heron naaccr_txform and naaccr_concepts_load scripts, the 
ultimate NAACCR ontology consists of all unique base/concept codes found in the 
imported NAACCR data file (now in table tumor_reg_codes) merged with some 
fields from either the NAACCR code table (naaccr.t_code) or one of the external 
tables (e.g., WHO.TOPO).

So, questions arise:

1)  First, is the above description reasonably accurate?  Please point out 
any glaring errors.  I did leave out some detail like special cases.

2)  I looked at WHO's site, and began applying for access to their tables 
as listed in heron\heron_staging\tumor_reg\icd_o_meta.py .  However, before I 
continue, will every site have to do this individually?  Tom Mish is of the 
opinion YES.  But, does anyone have a legal/kosher/ethical way for us to not 
duplicate this effort?

3)  If I'm correct that only codes that appear in the Registry data will be 
loaded into the ontology, well, is that OK?  This fixes the problems with code 
ranges and non-code values, and the ontology tree is effectively pre-trimmed.  
But what are the implications for future loads of new data?  Has this 
discussion already happened?

4)  Bonus question:  What the heck is naaccr_shortcuts.csv?  It looks very 
useful, I have no idea what for.


Thanks in advance for any input all of y'all provide.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671

___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: Questions on KUMC's NAACCR_txform.sql

2014-11-03 Thread Lenon Patrick
I'm back to this now and I'm still in a fog re:  naaccr.extract_eav.  I can't 
find any definition of it in the KUMC before it pops up in naaccr_txform.sql.  
Where could I find the DDL that defines this view?


From: Munns, Michael B [mailto:mike.mu...@unmc.edu]
Sent: Monday, October 20, 2014 3:55 PM
To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

The naaccr_extract_eav is a view that is in the script to create the extract 
table, naaccr_extract.sql.

Michael Munns
Database Analyst
402-559-3821

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick
Sent: Monday, October 20, 2014 1:05 PM
To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Thx for all the info, Nate.  I'll take these to the gpc-dev list.

Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle 
out looking at your load/transform scripts, specifically: 
https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql

Would you mind looking at those when convenient?  Thanks.

I also have some questions about the CDM for NAACCR up on Babel, but I'll put 
them on a new thread.

From: Nathan Graham [mailto:ngra...@kumc.edu]
Sent: Monday, October 20, 2014 12:45 PM
To: Lenon Patrick; Dan Connolly
Cc: 'Kowalski, George'; Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Dan, could you help with the questions below?

Patrick,

Dan Connolly is probably the most knowledgeable about NAACCR/HERON.  Also, 
perhaps we could move this conversation to the gpc-dev list?  Others might 
benefit from the conversation as well.



From: Lenon Patrick [mailto:ple...@uwhealth.org]
Sent: Friday, October 17, 2014 11:48 AM
To: Nathan Graham
Cc: 'Kowalski, George'; Mish Thomas F
Subject: Questions on KUMC's NAACCR_txform.sql

Hi Nathan,
I have a pretty good grasp of your transform proc now but I have a few nagging 
questions, please reply when convenient:


1)  You reference tables naaccr.t_item and naaccr.t_section.  I suspect 
they are just normalized tables created from the NAACCR data dictionary (see 
attached) with a separate table for the section types.  If so, did you generate 
the Section_ID field yourself?  It's the only one I don't see in NAACCR.



2)  The tumor_item_value (aka TIV) view is based on table 
naaccr_extract_eav.  The view tumor_reg_facts then merges TIV with 
naaccr.extract.  So, what's naaccr_extract_eav?  My guess is that it's a clone 
of naaccr.extract that contains properly transformed and formatted fact values. 
 If so (or if I'm even close), where does naaccr_extract_eav get loaded up and 
transformed?

If you can help me with these couple of items I'll be way ahead on this.  
Thanks for leading the way.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671


The information in this e-mail may be privileged and confidential, intended 
only for the use of the addressee(s) above. Any unauthorized use or disclosure 
of this information is prohibited. If you have received this e-mail by mistake, 
please delete it and immediately contact the sender.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: Questions on KUMC's NAACCR_txform.sql

2014-11-03 Thread Lenon Patrick
I must have been covering up completely confused by acting content.  ;)

I'll talk to Tom Mish about the MultiSiteDev thing.  Thanks for the assistance.



From: Dan Connolly [mailto:dconno...@kumc.edu]
Sent: Monday, November 03, 2014 1:24 PM
To: Lenon Patrick; 'Munns, Michael B'; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

I was a little surprised when it looked like you were content with the earlier 
answers; as far as I know, you don't participate in the 
MultiSiteDevhttps://informatics.gpcnetwork.org/trac/Project/wiki/MultiSiteDev 
process, which gives you access to code such as naaccr_extract.sql.

I just added 
attachment:naaccr_staging.ziphttps://informatics.gpcnetwork.org/trac/Project/attachment/ticket/44/naaccr_staging.zip
 to the portable HERON ETL for NAACCR ticket 
(#44https://informatics.gpcnetwork.org/trac/Project/ticket/44). It has a 
Makefile that you can use to download the spec and generate naaccr_extract.sql 
from that spec.

Unfortunately, the license terms for redistributing naaccr_extract.sql itself 
are unclear.

--
Dan

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[gpc-dev-boun...@listserv.kumc.edu] on behalf of Lenon Patrick 
[ple...@uwhealth.org]
Sent: Monday, November 03, 2014 11:28 AM
To: 'Munns, Michael B'; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql
I'm back to this now and I'm still in a fog re:  naaccr.extract_eav.  I can't 
find any definition of it in the KUMC before it pops up in naaccr_txform.sql.  
Where could I find the DDL that defines this view?


From: Munns, Michael B [mailto:mike.mu...@unmc.edu]
Sent: Monday, October 20, 2014 3:55 PM
To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

The naaccr_extract_eav is a view that is in the script to create the extract 
table, naaccr_extract.sql.

Michael Munns
Database Analyst
402-559-3821

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick
Sent: Monday, October 20, 2014 1:05 PM
To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Thx for all the info, Nate.  I'll take these to the gpc-dev list.

Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle 
out looking at your load/transform scripts, specifically: 
https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql

Would you mind looking at those when convenient?  Thanks.

I also have some questions about the CDM for NAACCR up on Babel, but I'll put 
them on a new thread.

From: Nathan Graham [mailto:ngra...@kumc.edu]
Sent: Monday, October 20, 2014 12:45 PM
To: Lenon Patrick; Dan Connolly
Cc: 'Kowalski, George'; Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Dan, could you help with the questions below?

Patrick,

Dan Connolly is probably the most knowledgeable about NAACCR/HERON.  Also, 
perhaps we could move this conversation to the gpc-dev list?  Others might 
benefit from the conversation as well.



From: Lenon Patrick [mailto:ple...@uwhealth.org]
Sent: Friday, October 17, 2014 11:48 AM
To: Nathan Graham
Cc: 'Kowalski, George'; Mish Thomas F
Subject: Questions on KUMC's NAACCR_txform.sql

Hi Nathan,
I have a pretty good grasp of your transform proc now but I have a few nagging 
questions, please reply when convenient:


1)  You reference tables naaccr.t_item and naaccr.t_section.  I suspect 
they are just normalized tables created from the NAACCR data dictionary (see 
attached) with a separate table for the section types.  If so, did you generate 
the Section_ID field yourself?  It's the only one I don't see in NAACCR.



2)  The tumor_item_value (aka TIV) view is based on table 
naaccr_extract_eav.  The view tumor_reg_facts then merges TIV with 
naaccr.extract.  So, what's naaccr_extract_eav?  My guess is that it's a clone 
of naaccr.extract that contains properly transformed and formatted fact values. 
 If so (or if I'm even close), where does naaccr_extract_eav get loaded up and 
transformed?

If you can help me with these couple of items I'll be way ahead on this.  
Thanks for leading the way.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671


The information in this e-mail may be privileged and confidential, intended 
only for the use of the addressee(s) above. Any unauthorized use or disclosure 
of this information is prohibited. If you have received this e-mail by mistake, 
please delete it and immediately contact the sender.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev


RE: Questions on KUMC's NAACCR_txform.sql

2014-10-20 Thread Lenon Patrick
Excellent, thanks.

From: Munns, Michael B [mailto:mike.mu...@unmc.edu]
Sent: Monday, October 20, 2014 3:55 PM
To: Lenon Patrick; 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

The naaccr_extract_eav is a view that is in the script to create the extract 
table, naaccr_extract.sql.

Michael Munns
Database Analyst
402-559-3821

From: 
gpc-dev-boun...@listserv.kumc.edumailto:gpc-dev-boun...@listserv.kumc.edu 
[mailto:gpc-dev-boun...@listserv.kumc.edu] On Behalf Of Lenon Patrick
Sent: Monday, October 20, 2014 1:05 PM
To: 'Nathan Graham'; Dan Connolly; 'Gpc-dev@listserv.kumc.edu'
Cc: Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Thx for all the info, Nate.  I'll take these to the gpc-dev list.

Dan, below there are a few questions about NAACCR/HERON that I couldn't puzzle 
out looking at your load/transform scripts, specifically: 
https://informatics.kumc.edu/work/browser/heron_load/naaccr_txform.sql

Would you mind looking at those when convenient?  Thanks.

I also have some questions about the CDM for NAACCR up on Babel, but I'll put 
them on a new thread.

From: Nathan Graham [mailto:ngra...@kumc.edu]
Sent: Monday, October 20, 2014 12:45 PM
To: Lenon Patrick; Dan Connolly
Cc: 'Kowalski, George'; Mish Thomas F
Subject: RE: Questions on KUMC's NAACCR_txform.sql

Dan, could you help with the questions below?

Patrick,

Dan Connolly is probably the most knowledgeable about NAACCR/HERON.  Also, 
perhaps we could move this conversation to the gpc-dev list?  Others might 
benefit from the conversation as well.



From: Lenon Patrick [mailto:ple...@uwhealth.org]
Sent: Friday, October 17, 2014 11:48 AM
To: Nathan Graham
Cc: 'Kowalski, George'; Mish Thomas F
Subject: Questions on KUMC's NAACCR_txform.sql

Hi Nathan,
I have a pretty good grasp of your transform proc now but I have a few nagging 
questions, please reply when convenient:


1)  You reference tables naaccr.t_item and naaccr.t_section.  I suspect 
they are just normalized tables created from the NAACCR data dictionary (see 
attached) with a separate table for the section types.  If so, did you generate 
the Section_ID field yourself?  It's the only one I don't see in NAACCR.



2)  The tumor_item_value (aka TIV) view is based on table 
naaccr_extract_eav.  The view tumor_reg_facts then merges TIV with 
naaccr.extract.  So, what's naaccr_extract_eav?  My guess is that it's a clone 
of naaccr.extract that contains properly transformed and formatted fact values. 
 If so (or if I'm even close), where does naaccr_extract_eav get loaded up and 
transformed?

If you can help me with these couple of items I'll be way ahead on this.  
Thanks for leading the way.


Patrick Lenon
HIMC Informatics Specialist
608 890 5671


The information in this e-mail may be privileged and confidential, intended 
only for the use of the addressee(s) above. Any unauthorized use or disclosure 
of this information is prohibited. If you have received this e-mail by mistake, 
please delete it and immediately contact the sender.
___
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev