[ 
https://issues.apache.org/jira/browse/MADLIB-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284304#comment-16284304
 ] 

Nikhil edited comment on MADLIB-1185 at 12/11/17 5:48 PM:
----------------------------------------------------------

*+RCA+*

The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
    explicit 
    PGException()
      : std::runtime_error("The backend raised an exception.") { }
    
    // FIXME: Do something useful with inErrorData
    PGException(ErrorData* /* inErrorData */)
      : std::runtime_error("The backend raised an exception.") {  }
};
{code}

The root cause of the problem lies in the type_info constructor in the 
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.

All these files define a type_info struct like this
{code}
typedef struct __type_info{
    Oid oid;
    int16_t len;
    bool    byval;
    char    align;

    __type_info(Oid oid):oid(oid)
    {
        madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
    }
} type_info;

static type_info FLOAT8TI(FLOAT8OID);
{code}

madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function 
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and 
does not print the actual exception coming from postgres. So we had to replace 
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the 
actual error. After that, we saw the following exception
{code}
  ERROR:  invalid cache ID: 74
  CONTEXT:  parallel worker
{code}

get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign 
values to the struct members len, byval and align.

The problem here is that when you open a psql session and call any c madlib udf 
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling 
all the type_info constructors during dlopen(because of this code {code}static 
type_info FLOAT8TI(FLOAT8OID);{code}) which in turn call SearchSysCache1.  It 
is not recommended to call SearchSysCache1  during init.  Here is a relevant 
postgres thread about it: 

https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com

Hardcoding all the type_info struct members inside the constructor fixes the 
problem.  



was (Author: nikhilkak):
*+RCA+*

The exception is coming from this code in PGException_proto.hpp
{code}
class PGException : public std::runtime_error {
public:
    explicit 
    PGException()
      : std::runtime_error("The backend raised an exception.") { }
    
    // FIXME: Do something useful with inErrorData
    PGException(ErrorData* /* inErrorData */)
      : std::runtime_error("The backend raised an exception.") {  }
};
{code}

The root cause of the problem lies in the type_info constructor in the 
following files: viterbi.cpp, lda.cpp, svd.cpp, matrix_ops.cpp and arima.cpp.

All these files define a type_info struct like this
{code}
typedef struct __type_info{
    Oid oid;
    int16_t len;
    bool    byval;
    char    align;

    __type_info(Oid oid):oid(oid)
    {
        madlib_get_typlenbyvalalign(oid, &len, &byval, &align);
    }
} type_info;

static type_info FLOAT8TI(FLOAT8OID);
{code}

madlib_get_typlenbyvalalign is a madlib wrapper over the postgres function 
get_typlenbyvalalign. madlib_get_typlenbyvalalign catches the exception and 
does not print the actual exception coming from postgres. So we had to replace 
all calls to madlib_get_typlenbyvalalign with get_typlenbyvalalign to see the 
actual error. After that, we saw the following exception
{code}
  ERROR:  invalid cache ID: 74
  CONTEXT:  parallel worker
{code}

get_typlenbyvalalign makes a call to SearchSysCache1 and is called to assign 
values to the struct members len, byval and align.

The problem here is that when you open a psql session and call any c madlib udf 
for the first time, postgres calls dlopen on libmadlib.so. This ends up calling 
all the type_info constructors during dlopen(the first call to dlopen will 
always call all the struct constructors.) which in turn call SearchSysCache1.  
It is not recommended to call SearchSysCache1  during init.  Here is a relevant 
postgres thread about it: 

https://www.postgresql.org/message-id/96420364a3d055172776752a1de80714%40smtp.hushmail.com

Hardcoding all the type_info struct members inside the constructor fixes the 
problem.  


> Postgres 10 support for MADlib with large tables
> ------------------------------------------------
>
>                 Key: MADLIB-1185
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1185
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: DB Abstraction Layer
>            Reporter: Nikhil
>             Fix For: v1.13
>
>
> Running MADlib on postgres10 with a large dataset ( 98000 rows with a double 
> array column) causes the database to crash.
> Repro Steps
> {code}
> 1. create table foo (id integer, x double precision[], y integer);
> 2. Insert at least 1 million rows like these
>   id   |            x            | y
> -------+-------------------------+---
>  97440 | {1,0.2,0,1,0,1,0,0,0,0} | 1
> 3. Now running any C madlib UDF followed by a count(*) of foo will cause the 
> database to crash
> select madlib.poisson_random(1); select count(*) from foo;
> or
> select madlib.svec_plus('{1}:{5}', '{1}:{4}'); select count(*) from foo;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to