[jira] [Updated] (MADLIB-1257) PostgreSQL crashed during random forest training

Frank McQuillan (JIRA) Tue, 24 Jul 2018 12:12:20 -0700


     [ 
https://issues.apache.org/jira/browse/MADLIB-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan updated MADLIB-1257:
------------------------------------
    Description: 
User reported bug:

I got a problem when training the grouped data with random forest(300 
features). Small data was fine ( eg, 56K instances in 56 groups), but failed 
for 240K instances in 250 groups. Postgres forced to disconnect the session 
after showing the below message in verbose mode:

{code:sql}
NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary 
view
NOTICE:  sql_create_empty_result_table:

            CREATE TABLE analysis.dx_rf_train_output_1 (
                gid         integer,
                sample_id   integer,
                tree        madlib.bytea8);

NOTICE:  sql_refresh_training_pois_cnt:

                            TRUNCATE TABLE 
__madlib_temp_91155016_1532371657_5660955__ CASCADE;
                            INSERT INTO 
__madlib_temp_91155016_1532371657_5660955__
                            SELECT
                                *,
                                madlib.poisson_random(1) AS poisson_count
                            FROM
                            (
                                SELECT
                                    *,
                                    0.::double precision AS 
__madlib_temp_14328459_1532371657_7318497__
                                FROM analysis.dxpredict_svec
                            ) subq
                            WHERE __madlib_temp_14328459_1532371657_7318497__ < 
1

NOTICE:
                        src_cnt: 158360,
                        oob_cnt: 92418,
                        dup_cnt: 250617.

NOTICE:  Started tree building for all groups
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The PostgreSQL did not capture the detail log even I increased the logstatement 
to "all" 
2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was 
terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running: SELECT 
madlib.forest_train('analysis.dxpredict_svec',
                                   'analysis.dx_rf_train_output_1',
                                   'rowid',
                                   'positive',
                                   '*',
                                   'rowid,positive,case_icd',
                                   'case_icd',
                                   30::integer,
                                   30::integer,
                                   TRUE::boolean,
                                   1::integer,
                                   10::integer,
                                   3::integer,
                                   1::integer,
                                   10::integer,
                                   NULL,
                                   TRUE
                                   );
2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active server 
processes
2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection because of 
crash of another server process
{code}

Another observation -  It crashed with 84 groups and 73K instance. In this 
scenario, I shall have pretty enough memory and disk. 

Also seems during the increasing of the groups, it used a lot of temporary disk 
space when the data is over certain groups.


  was:
User reported bug:

I got a problem when training the grouped data with random forest(300 
features). Small data was fine ( eg, 56K instances in 56 groups), but failed 
for 240K instances in 250 groups. Postgres forced to disconnect the session 
after showing the below message in verbose mode:

{code:sql}
NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary 
view
NOTICE:  sql_create_empty_result_table:

            CREATE TABLE analysis.dx_rf_train_output_1 (
                gid         integer,
                sample_id   integer,
                tree        madlib.bytea8);

NOTICE:  sql_refresh_training_pois_cnt:

                            TRUNCATE TABLE 
__madlib_temp_91155016_1532371657_5660955__ CASCADE;
                            INSERT INTO 
__madlib_temp_91155016_1532371657_5660955__
                            SELECT
                                *,
                                madlib.poisson_random(1) AS poisson_count
                            FROM
                            (
                                SELECT
                                    *,
                                    0.::double precision AS 
__madlib_temp_14328459_1532371657_7318497__
                                FROM analysis.dxpredict_svec
                            ) subq
                            WHERE __madlib_temp_14328459_1532371657_7318497__ < 
1

NOTICE:
                        src_cnt: 158360,
                        oob_cnt: 92418,
                        dup_cnt: 250617.

NOTICE:  Started tree building for all groups
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The PostgreSQL did not capture the detail log even I increased the logstatement 
to "all" 
2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was 
terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running: SELECT 
madlib.forest_train('analysis.dxpredict_svec',
                                   'analysis.dx_rf_train_output_1',
                                   'rowid',
                                   'positive',
                                   '*',
                                   'rowid,positive,case_icd',
                                   'case_icd',
                                   30::integer,
                                   30::integer,
                                   TRUE::boolean,
                                   1::integer,
                                   10::integer,
                                   3::integer,
                                   1::integer,
                                   10::integer,
                                   NULL,
                                   TRUE
                                   );
2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active server 
processes
2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection because of 
crash of another server process
{code}



> PostgreSQL crashed during random forest training
> ------------------------------------------------
>
>                 Key: MADLIB-1257
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1257
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Rahul Iyer
>            Priority: Major
>             Fix For: v2.0
>
>         Attachments: train_data.gz
>
>
> User reported bug:
> I got a problem when training the grouped data with random forest(300 
> features). Small data was fine ( eg, 56K instances in 56 groups), but failed 
> for 240K instances in 250 groups. Postgres forced to disconnect the session 
> after showing the below message in verbose mode:
> {code:sql}
> NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a 
> temporary view
> NOTICE:  sql_create_empty_result_table:
>             CREATE TABLE analysis.dx_rf_train_output_1 (
>                 gid         integer,
>                 sample_id   integer,
>                 tree        madlib.bytea8);
> NOTICE:  sql_refresh_training_pois_cnt:
>                             TRUNCATE TABLE 
> __madlib_temp_91155016_1532371657_5660955__ CASCADE;
>                             INSERT INTO 
> __madlib_temp_91155016_1532371657_5660955__
>                             SELECT
>                                 *,
>                                 madlib.poisson_random(1) AS poisson_count
>                             FROM
>                             (
>                                 SELECT
>                                     *,
>                                     0.::double precision AS 
> __madlib_temp_14328459_1532371657_7318497__
>                                 FROM analysis.dxpredict_svec
>                             ) subq
>                             WHERE __madlib_temp_14328459_1532371657_7318497__ 
> < 1
> NOTICE:
>                         src_cnt: 158360,
>                         oob_cnt: 92418,
>                         dup_cnt: 250617.
> NOTICE:  Started tree building for all groups
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> The PostgreSQL did not capture the detail log even I increased the 
> logstatement to "all" 
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was 
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running: 
> SELECT madlib.forest_train('analysis.dxpredict_svec',
>                                    'analysis.dx_rf_train_output_1',
>                                    'rowid',
>                                    'positive',
>                                    '*',
>                                    'rowid,positive,case_icd',
>                                    'case_icd',
>                                    30::integer,
>                                    30::integer,
>                                    TRUE::boolean,
>                                    1::integer,
>                                    10::integer,
>                                    3::integer,
>                                    1::integer,
>                                    10::integer,
>                                    NULL,
>                                    TRUE
>                                    );
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active server 
> processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection because 
> of crash of another server process
> {code}
> Another observation -  It crashed with 84 groups and 73K instance. In this 
> scenario, I shall have pretty enough memory and disk. 
> Also seems during the increasing of the groups, it used a lot of temporary 
> disk space when the data is over certain groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1257) PostgreSQL crashed during random forest training

Reply via email to