#56: shareable synthetic test data sets: Epic clarity, i2b2, NAACCR, ...
 Reporter:  dconnolly     |       Owner:  bos
     Type:  enhancement   |      Status:  assigned
 Priority:  minor         |   Milestone:  data-domains3
Component:  data-sharing  |  Resolution:
 Keywords:                |  Blocked By:
 Blocking:                |

 While testing NAACCR ETL refinements for #258 and related stuff, I'm
 reminded that we don't have much test data. I'm interested to pursue the
 idea of characterizing existing data and synthesizing data based on those

   - For nominal data, calculate frequencies and use the frequencies to
 pick random values
   - For numeric measures, assume a normal distribution
   - For dates, treat them as numeric measures by subtracting date of

 for bonus points:
   - use primary site, stage, age, sex to influence probabilities of other
   - for text, use trigrams and learn about hidden Markov models

