[ https://issues.apache.org/jira/browse/MADLIB-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16560725#comment-16560725 ]
Frank McQuillan commented on MADLIB-1257: ----------------------------------------- The similar problem happened in decision tree. ( with the same set of data ). I got the error (dmesg) that " [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]" > PostgreSQL crashed during random forest training > ------------------------------------------------ > > Key: MADLIB-1257 > URL: https://issues.apache.org/jira/browse/MADLIB-1257 > Project: Apache MADlib > Issue Type: Bug > Components: Module: Random Forest > Reporter: Rahul Iyer > Priority: Major > Fix For: v2.0 > > Attachments: train_data.gz > > > User reported bug: > I got a problem when training the grouped data with random forest(300 > features). Small data was fine ( eg, 56K instances in 56 groups), but failed > for 240K instances in 250 groups. Postgres forced to disconnect the session > after showing the below message in verbose mode: > {code:sql} > NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a > temporary view > NOTICE: sql_create_empty_result_table: > CREATE TABLE analysis.dx_rf_train_output_1 ( > gid integer, > sample_id integer, > tree madlib.bytea8); > NOTICE: sql_refresh_training_pois_cnt: > TRUNCATE TABLE > __madlib_temp_91155016_1532371657_5660955__ CASCADE; > INSERT INTO > __madlib_temp_91155016_1532371657_5660955__ > SELECT > *, > madlib.poisson_random(1) AS poisson_count > FROM > ( > SELECT > *, > 0.::double precision AS > __madlib_temp_14328459_1532371657_7318497__ > FROM analysis.dxpredict_svec > ) subq > WHERE __madlib_temp_14328459_1532371657_7318497__ > < 1 > NOTICE: > src_cnt: 158360, > oob_cnt: 92418, > dup_cnt: 250617. > NOTICE: Started tree building for all groups > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > The PostgreSQL did not capture the detail log even I increased the > logstatement to "all" > 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was > terminated by signal 11: Segmentation fault > 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: > SELECT madlib.forest_train('analysis.dxpredict_svec', > 'analysis.dx_rf_train_output_1', > 'rowid', > 'positive', > '*', > 'rowid,positive,case_icd', > 'case_icd', > 30::integer, > 30::integer, > TRUE::boolean, > 1::integer, > 10::integer, > 3::integer, > 1::integer, > 10::integer, > NULL, > TRUE > ); > 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active server > processes > 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection because > of crash of another server process > {code} > Another observation - It crashed with 84 groups and 73K instance. In this > scenario, I shall have pretty enough memory and disk. > Also seems during the increasing of the groups, it used a lot of temporary > disk space when the data is over certain groups. -- This message was sent by Atlassian JIRA (v7.6.3#76005)