[jira] [Commented] (MADLIB-1257) PostgreSQL crashed during random forest training

Frank McQuillan (JIRA) Sat, 28 Jul 2018 04:28:30 -0700


    [ 
https://issues.apache.org/jira/browse/MADLIB-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16560725#comment-16560725
 ]


Frank McQuillan commented on MADLIB-1257:
-----------------------------------------

The similar problem happened in decision tree.  ( with the same set of data ).


I got the error (dmesg) that "
 [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp 
00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"

> PostgreSQL crashed during random forest training
> ------------------------------------------------
>
>                 Key: MADLIB-1257
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1257
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Rahul Iyer
>            Priority: Major
>             Fix For: v2.0
>
>         Attachments: train_data.gz
>
>
> User reported bug:
> I got a problem when training the grouped data with random forest(300 
> features). Small data was fine ( eg, 56K instances in 56 groups), but failed 
> for 240K instances in 250 groups. Postgres forced to disconnect the session 
> after showing the below message in verbose mode:
> {code:sql}
> NOTICE:  view "__madlib_temp_60124179_1532371657_7130296__" will be a 
> temporary view
> NOTICE:  sql_create_empty_result_table:
>             CREATE TABLE analysis.dx_rf_train_output_1 (
>                 gid         integer,
>                 sample_id   integer,
>                 tree        madlib.bytea8);
> NOTICE:  sql_refresh_training_pois_cnt:
>                             TRUNCATE TABLE 
> __madlib_temp_91155016_1532371657_5660955__ CASCADE;
>                             INSERT INTO 
> __madlib_temp_91155016_1532371657_5660955__
>                             SELECT
>                                 *,
>                                 madlib.poisson_random(1) AS poisson_count
>                             FROM
>                             (
>                                 SELECT
>                                     *,
>                                     0.::double precision AS 
> __madlib_temp_14328459_1532371657_7318497__
>                                 FROM analysis.dxpredict_svec
>                             ) subq
>                             WHERE __madlib_temp_14328459_1532371657_7318497__ 
> < 1
> NOTICE:
>                         src_cnt: 158360,
>                         oob_cnt: 92418,
>                         dup_cnt: 250617.
> NOTICE:  Started tree building for all groups
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> The PostgreSQL did not capture the detail log even I increased the 
> logstatement to "all" 
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  server process (PID 1980) was 
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL:  Failed process was running: 
> SELECT madlib.forest_train('analysis.dxpredict_svec',
>                                    'analysis.dx_rf_train_output_1',
>                                    'rowid',
>                                    'positive',
>                                    '*',
>                                    'rowid,positive,case_icd',
>                                    'case_icd',
>                                    30::integer,
>                                    30::integer,
>                                    TRUE::boolean,
>                                    1::integer,
>                                    10::integer,
>                                    3::integer,
>                                    1::integer,
>                                    10::integer,
>                                    NULL,
>                                    TRUE
>                                    );
> 2018-07-23 14:47:50.229 EDT [1090] LOG:  terminating any other active server 
> processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING:  terminating connection because 
> of crash of another server process
> {code}
> Another observation -  It crashed with 84 groups and 73K instance. In this 
> scenario, I shall have pretty enough memory and disk. 
> Also seems during the increasing of the groups, it used a lot of temporary 
> disk space when the data is over certain groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1257) PostgreSQL crashed during random forest training

Reply via email to