[
https://issues.apache.org/jira/browse/MADLIB-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767483#comment-16767483
]
Frank McQuillan edited comment on MADLIB-1301 at 2/13/19 7:21 PM:
------------------------------------------------------------------
One idea is to do something like we do in
http://madlib.apache.org/docs/latest/group__grp__summary.html
with the parameter
{code}
n_cols_per_run (optional)
INTEGER, default: 15. The number of columns to collect summary statistics in
one pass of the data. This parameter determines the number of passes through
the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run
= 15', there will be 3 passes through the data, with each pass summarizing a
maximum of 15 columns.
Note
This parameter should be used with caution. Increasing this parameter could
decrease the total run time (if number of passes decreases), but will increase
the memory consumption during each run. Since PostgreSQL limits the memory
available for a single aggregate run, this increased memory consumption could
result in an out-of-memory termination error.
{code}
i.e., limit the number of groups processed per pass over the data. Default
could be "all" like it is now, then allow user to reduce if there are memory
issues.
was (Author: fmcquillan):
One idea is to do something like we do in
http://madlib.apache.org/docs/latest/group__grp__summary.html
with the parameter
{code}
n_cols_per_run (optional)
INTEGER, default: 15. The number of columns to collect summary statistics in
one pass of the data. This parameter determines the number of passes through
the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run
= 15', there will be 3 passes through the data, with each pass summarizing a
maximum of 15 columns.
Note
This parameter should be used with caution. Increasing this parameter could
decrease the total run time (if number of passes decreases), but will increase
the memory consumption during each run. Since PostgreSQL limits the memory
available for a single aggregate run, this increased memory consumption could
result in an out-of-memory termination error.
{code}
> Improve correlation and covariance memory usage with large number of groups
> ---------------------------------------------------------------------------
>
> Key: MADLIB-1301
> URL: https://issues.apache.org/jira/browse/MADLIB-1301
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Descriptive Statistics
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v2.0
>
>
> When correlation and covariance are run with large number of groups (100's),
> can run out of memory. Increasing statement_mem helps, but this JIRA is to
> investigate and improve memory usage with large numbers of groups.
> Sample findings on correlation for 300K input data set:
> || #groups || statement mem 186M || statement mem 200M || statement mem 500M
> || statement mem 1000M ||
> | 6 | Success | Success | Success | - |
> | 127 | Success | Success | - | - |
> | 930 | Fail | Fail | Success | - |
> | 1213 | Fail | Fail | Success | - |
> | 4852 | Fail | Fail | Fail | Fail |
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)