dml is quite amazing. I was wondering if there is a user friendly (more human readable) version of outputs from Univar-Stats.dml? I ran the Univar-Stats.dml on my data set that contains 7 variables: two continuous, one categorical. The output is a csv file on HDFS that looks like this:
1 1 10.0 2 1 123.0 2 7 469.0 3 1 122.0 3 7 419.0 4 1 34.852512104922082 4 7 0.40786451178676335 5 1 613.6600902369631 5 7 1.5322171660886 6 1 25.566777079580508 6 7 5.54382044429201915 7 1 0.219263232610989764 7 7 12.14558700418414E-4 8 1 0.5323447433694138 8 7 1.23151883029726626 9 1 0.28352047550156284 9 7 23.25049533659206 10 1 -0.5348573740280274 10 7 2023.294658877635 11 1 2.874872545380876E-4 11 7 1.874872545380876E-4 12 1 6.0017749742760714085 12 7 0.00237749742760714085 13 1 12.0 14 1 30.56066514110724 15 2 4.0 ---- truncated (numbers randomly modified) According to the documentation on http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics , the first column of the matrix represents statistics type (minimum, mean, etc.), the second column represents variable ID and the last column gives the statistics value. While the documentation is very clear and the results are consistent with outputs of other software like R, I found the format a bit inconvenient since I have to refer to the reference Table (table 1 in aforementioned link) to understand the summary statistics. I understand that the pure numeric matrix format is easy to use as machine input for future steps. An additional table that is more human readable would be nice since the main purpose of uni-variate statistics is often exploratory data analysis and a clear summary is essential. Suggestions to consider in the readable summary if there's not already one: 1. Order the rows according to variables (column 2) instead of statistics type (column 1), so that summary statistics of the same variable are grouped together. 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of IDs (1, 2, etc). 3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2, etc). 4. Use level labels for categorical predictors ("male", "female", etc) instead of IDs (1,2, etc). 5. Add counts of cases in each level for categorical variable in addition to modes. This gives the distribution information of the variable. 6. If the amount of data in the summary is manageable perhaps automatically pull the output of Univar-Stats.dml from HDFS to local machine and display the readable version on terminal? So the output could look like: age min 10 age max 123 age range 113 age mean 60 ... gender female.count 1000 gender male.count 2000 gender mode male ... or even a table format like in R: age gender min 10 female 1000 max 123 male 2000 range 113 mode male mean 60 ... ... Thanks much, Ethan Xu
