[ https://issues.apache.org/jira/browse/MADLIB-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284304#comment-16284304 ]
Nikhil edited comment on MADLIB-1185 at 12/11/17 5:48 PM: ---------------------------------------------------------- *+RCA+* The exception is coming from this code in PGException_proto.hpp {code} class PGException : public std::runtime_error { public: explicit PGException() : std::runtime_error("The backend raised an exception.") { } // FIXME: Do something useful with inErrorData PGException(ErrorData* /* inErrorData */) : std::runtime_error("The backend raised an exception.") { } }; {code} The root cause of the problem lies in the type_info constructor in the following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp. All these files define a type_info struct like this {code} typedef struct __type_info{ Oid oid; int16_t len; bool byval; char align; __type_info(Oid oid):oid(oid) { madlib_get_typlenbyvalalign(oid, &len, &byval, &align); } } type_info; static type_info FLOAT8TI(FLOAT8OID); {code} madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and does not print the actual exception coming from postgres. So we had to replace all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the actual error. After that, we saw the following exception {code} ERROR: invalid cache ID: 74 CONTEXT: parallel worker {code} get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign values to the struct members len, byval and align. The problem here is that when you open a psql session and call any c madlib udf for the first time, postgres calls dlopen on libmadlib.so. This ends up calling all the type_info constructors during dlopen(because of this code {code}static type_info FLOAT8TI(FLOAT8OID);{code}) which in turn call SearchSysCache1. It is not recommended to call SearchSysCache1 during init. Here is a relevant postgres thread about it: https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com Hardcoding all the type_info struct members inside the constructor fixes the problem. was (Author: nikhilkak): *+RCA+* The exception is coming from this code in PGException_proto.hpp {code} class PGException : public std::runtime_error { public: explicit PGException() : std::runtime_error("The backend raised an exception.") { } // FIXME: Do something useful with inErrorData PGException(ErrorData* /* inErrorData */) : std::runtime_error("The backend raised an exception.") { } }; {code} The root cause of the problem lies in the type_info constructor in the following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp. All these files define a type_info struct like this {code} typedef struct __type_info{ Oid oid; int16_t len; bool byval; char align; __type_info(Oid oid):oid(oid) { madlib_get_typlenbyvalalign(oid, &len, &byval, &align); } } type_info; static type_info FLOAT8TI(FLOAT8OID); {code} madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and does not print the actual exception coming from postgres. So we had to replace all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the actual error. After that, we saw the following exception {code} ERROR: invalid cache ID: 74 CONTEXT: parallel worker {code} get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign values to the struct members len, byval and align. The problem here is that when you open a psql session and call any c madlib udf for the first time, postgres calls dlopen on libmadlib.so. This ends up calling all the type_info constructors during dlopen(the first call to dlopen will always call all the struct constructors.) which in turn call SearchSysCache1. It is not recommended to call SearchSysCache1 during init. Here is a relevant postgres thread about it: https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com Hardcoding all the type_info struct members inside the constructor fixes the problem. > Postgres 10 support for MADlib with large tables > ------------------------------------------------ > > Key: MADLIB-1185 > URL: https://issues.apache.org/jira/browse/MADLIB-1185 > Project: Apache MADlib > Issue Type: Bug > Components: DB Abstraction Layer > Reporter: Nikhil > Fix For: v1.13 > > > Running MADlib on postgres10 with a large dataset ( 98000 rows with a double > array column) causes the database to crash. > Repro Steps > {code} > 1. create table foo (id integer, x double precision[], y integer); > 2. Insert at least 1 million rows like these > id | x | y > -------+-------------------------+--- > 97440 | {1,0.2,0,1,0,1,0,0,0,0} | 1 > 3. Now running any C madlib UDF followed by a count(*) of foo will cause the > database to crash > select madlib.poisson_random(1); select count(*) from foo; > or > select madlib.svec_plus('{1}:{5}', '{1}:{4}'); select count(*) from foo; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)