[jira] [Comment Edited] (MADLIB-1301) Improve correlation and covariance memory usage with large number of groups

Frank McQuillan (JIRA) Wed, 13 Feb 2019 11:38:55 -0800


    [ 
https://issues.apache.org/jira/browse/MADLIB-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767483#comment-16767483
 ]


Frank McQuillan edited comment on MADLIB-1301 at 2/13/19 7:21 PM:
------------------------------------------------------------------

One idea is to do something like we do in 
http://madlib.apache.org/docs/latest/group__grp__summary.html

with the parameter

{code}
n_cols_per_run (optional)

INTEGER, default: 15. The number of columns to collect summary statistics in 
one pass of the data. This parameter determines the number of passes through 
the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run 
= 15', there will be 3 passes through the data, with each pass summarizing a 
maximum of 15 columns.

Note
This parameter should be used with caution. Increasing this parameter could 
decrease the total run time (if number of passes decreases), but will increase 
the memory consumption during each run. Since PostgreSQL limits the memory 
available for a single aggregate run, this increased memory consumption could 
result in an out-of-memory termination error.
{code}

i.e., limit the number of groups processed per pass over the data.  Default 
could be "all" like it is now, then allow user to reduce if there are memory 
issues.


was (Author: fmcquillan):
One idea is to do something like we do in 
http://madlib.apache.org/docs/latest/group__grp__summary.html

with the parameter

{code}
n_cols_per_run (optional)

INTEGER, default: 15. The number of columns to collect summary statistics in 
one pass of the data. This parameter determines the number of passes through 
the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run 
= 15', there will be 3 passes through the data, with each pass summarizing a 
maximum of 15 columns.

Note
This parameter should be used with caution. Increasing this parameter could 
decrease the total run time (if number of passes decreases), but will increase 
the memory consumption during each run. Since PostgreSQL limits the memory 
available for a single aggregate run, this increased memory consumption could 
result in an out-of-memory termination error.
{code}

> Improve correlation and covariance memory usage with large number of groups
> ---------------------------------------------------------------------------
>
>                 Key: MADLIB-1301
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1301
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v2.0
>
>
> When correlation and covariance are run with large number of groups (100's), 
> can run out of memory.  Increasing statement_mem helps, but this JIRA is to 
> investigate and improve memory usage with large numbers of groups.
> Sample findings on correlation for 300K input data set:
> || #groups || statement mem 186M || statement mem 200M || statement mem 500M 
> || statement mem 1000M ||
> | 6 | Success | Success | Success | - |
> | 127 | Success | Success | - | - |
> | 930 | Fail | Fail | Success | - |
> | 1213 | Fail | Fail | Success | - |
> | 4852 | Fail | Fail | Fail | Fail |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MADLIB-1301) Improve correlation and covariance memory usage with large number of groups

Reply via email to