[
https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292006#comment-16292006
]
Jingyi Mei commented on MADLIB-1167:
------------------------------------
For the confidence interval, I propose to use z-score 1.96 to reflect 95%
confidence interval, as [~iyerr3] mentioned in the above comment that we can
safely assume a normal sampling distribution. For the format of the confidence
interval,I propose to use (a,b) to show the range, instead of (X ± zs/√n),
because we have other output column to reflect s and n. Any thoughts?
> Summary - add more statistics
> -----------------------------
>
> Key: MADLIB-1167
> URL: https://issues.apache.org/jira/browse/MADLIB-1167
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Descriptive Statistics
> Reporter: Frank McQuillan
> Assignee: Jingyi Mei
> Fix For: v2.0
>
>
> In the summary function
> http://madlib.apache.org/docs/latest/group__grp__summary.html
> add additional statistics:
> 1) % positive values
> 2) % negative values
> 3) % zero values
> 4) confidence intervals (95% ?) on mean
> * does this make sense, since need to assume a distribution for the data
> which we probably cannot infer?
> 5) Also please check why min and max are being reported for non-numeric cols.
> Is this a bug?
> {code}
> madlib=# SELECT * FROM houses_summary where target_column='zipcode';
> -[ RECORD 1 ]--------+----------------
> group_by |
> group_by_value |
> target_column | zipcode
> column_number | 8
> data_type | text
> row_count | 15
> distinct_values | 2
> missing_values | 0
> blank_values | 0
> fraction_missing | 0
> fraction_blank | 0
> mean |
> variance |
> min | 6
> max | 6
> first_quartile |
> median |
> third_quartile |
> most_frequent_values | {94301y,84301x}
> mfv_frequencies | {10,5}
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)