dml is quite amazing. I was wondering if there is a user friendly (more 
human readable) version of outputs from Univar-Stats.dml? I ran the 
Univar-Stats.dml on my data set that contains 7 variables: two continuous, 
one categorical. The output is a csv file on HDFS that looks like this:

1 1 10.0
2 1 123.0
2 7 469.0
3 1 122.0
3 7 419.0
4 1 34.852512104922082
4 7 0.40786451178676335
5 1 613.6600902369631
5 7 1.5322171660886
6 1 25.566777079580508
6 7 5.54382044429201915
7 1 0.219263232610989764
7 7 12.14558700418414E-4
8 1 0.5323447433694138
8 7 1.23151883029726626
9 1 0.28352047550156284
9 7 23.25049533659206
10 1 -0.5348573740280274
10 7 2023.294658877635
11 1 2.874872545380876E-4
11 7 1.874872545380876E-4
12 1 6.0017749742760714085
12 7 0.00237749742760714085
13 1 12.0
14 1 30.56066514110724
15 2 4.0
---- truncated (numbers randomly modified)

According to the documentation on 
http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
, the first column of the matrix represents statistics type (minimum, 
mean, etc.), the second column represents variable ID and the last column 
gives the statistics value. 

While the documentation is very clear and the results are consistent with 
outputs of other software like R, I found the format a bit inconvenient 
since I have to refer to the reference Table (table 1 in aforementioned 
link) to understand the summary statistics. 

I understand that the pure numeric matrix format is easy to use as machine 
input for future steps. An additional table that is more human readable 
would be nice since the main purpose of uni-variate statistics is often 
exploratory data analysis and a clear summary is essential. 

Suggestions to consider in the readable summary if there's not already 
one:
1. Order the rows according to variables (column 2) instead of statistics 
type (column 1), so that summary statistics of the same variable are 
grouped together.
2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of 
IDs (1, 2, etc).
3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2, 
etc).
4. Use level labels for categorical predictors ("male", "female", etc) 
instead of IDs (1,2, etc).
5. Add counts of cases in each level for categorical variable in addition 
to modes. This gives the distribution information of the variable.
6. If the amount of data in the summary is manageable perhaps 
automatically pull the output of Univar-Stats.dml from HDFS to local 
machine and display the readable version on terminal? 

So the output could look like:

age min 10
age max 123
age range 113
age mean 60
...
gender female.count 1000
gender male.count 2000
gender mode male
...

or even a table format like in R:

age                  gender
min    10          female 1000
max   123        male 2000
range 113        mode male
mean  60         ...
...
Thanks much, 

Ethan Xu

Reply via email to