#56: shareable synthetic test data sets: Epic clarity, i2b2, NAACCR, ... --------------------------+---------------------------- Reporter: dconnolly | Owner: bos Type: enhancement | Status: assigned Priority: minor | Milestone: data-domains3 Component: data-sharing | Resolution: Keywords: | Blocked By: Blocking: | --------------------------+----------------------------
Comment (by dconnolly): While testing NAACCR ETL refinements for #258 and related stuff, I'm reminded that we don't have much test data. I'm interested to pursue the idea of characterizing existing data and synthesizing data based on those characteristics. - For nominal data, calculate frequencies and use the frequencies to pick random values - For numeric measures, assume a normal distribution - For dates, treat them as numeric measures by subtracting date of diagnosis for bonus points: - use primary site, stage, age, sex to influence probabilities of other data - for text, use trigrams and learn about hidden Markov models -- Ticket URL: <http://informatics.gpcnetwork.org/trac/Project/ticket/56#comment:28> gpc-informatics <http://informatics.gpcnetwork.org/> Greater Plains Network - Informatics _______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev