#56: shareable synthetic test data sets: Epic clarity, i2b2, NAACCR, ...
--------------------------+----------------------------
 Reporter:  dconnolly     |       Owner:  bos
     Type:  enhancement   |      Status:  assigned
 Priority:  minor         |   Milestone:  data-domains3
Component:  data-sharing  |  Resolution:
 Keywords:                |  Blocked By:
 Blocking:                |
--------------------------+----------------------------

Comment (by dconnolly):

 While testing NAACCR ETL refinements for #258 and related stuff, I'm
 reminded that we don't have much test data. I'm interested to pursue the
 idea of characterizing existing data and synthesizing data based on those
 characteristics.

   - For nominal data, calculate frequencies and use the frequencies to
 pick random values
   - For numeric measures, assume a normal distribution
   - For dates, treat them as numeric measures by subtracting date of
 diagnosis

 for bonus points:
   - use primary site, stage, age, sex to influence probabilities of other
 data
   - for text, use trigrams and learn about hidden Markov models

--
Ticket URL: 
<http://informatics.gpcnetwork.org/trac/Project/ticket/56#comment:28>
gpc-informatics <http://informatics.gpcnetwork.org/>
Greater Plains Network - Informatics
_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to