The stck that Dave sees has been showing up on official builds once in a while. There are 2 threads doing hdfs access and one of them seems to be accessing resources that the other has destructed. It used to show up often during hdfsReads but I had made a fix so it can never happen during the active reading phase. This needs to be investigated by looking at the code ot see if there is a window where this can happen during the destructor. Perhaps we need to let each thread finish it's work or something. But from a quick look at the ExLobGlobals destructor, it seems to be doing that today. So I am not sure...haven't had the chance to spend time on it yet but will do so in a few days. Thanks Sandhya
-----Original Message----- From: Roberta Marton [mailto:[email protected]] Sent: Thursday, April 6, 2017 9:31 AM To: [email protected] Subject: RE: Hang in hive/TEST007? This test failed for me in a slightly different way but maybe they are related. The hive constructor for NATable is not initializing the variable (prototype_) so once in a while the call to the destructor cores. I noticed that you were trying to call the destructor for a different object. If the NATable destructor actually did memory deletes for prototype_ on an invalid address perhaps it caused some kind of memory corruption. Later, when then destructor for ExLobGlobals is called, memory was invalid. Roberta -----Original Message----- From: Dave Birdsall [mailto:[email protected]] Sent: Thursday, April 6, 2017 8:50 AM To: [email protected] Subject: Hang in hive/TEST007? Hi, I was running Trafodion regressions on a workstation overnight and came in this morning to find them hung in test hive/TEST007, at the following statement: >>insert into trafodion.sch007.vhive1 values (3,4); --- 1 row(s) inserted. >>select * from thive1; A B ----------- ----------- 1 2 3 4 --- 2 row(s) selected. Notice the "--- 2 rows(s) selected" and then no sqlci prompt. Just stuck there. Checking dtmci, there were no transactions active or outstanding. I put the sqlci process into gdb, and looked at the stack trace: (gdb) bt #0 0x00007f1e9be5b2ad in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f1e9e2c8c5b in ExLobGlobals::~ExLobGlobals (this=0x3c4e640, __in_chrg=<value optimized out>) at ../exp/ExpLOBaccess.cpp:3176 #2 0x00007f1e9e2c717f in ExLobsOper (lobName=0x7f1e9eb666d7 "dummy", handleIn= 0x0, handleInLen=0, hdfsServer=0x0, hdfsPort=0, handleOut=0x0, handleOutLen=@0x7ffc34b26334, descNumIn=0, descNumOut=@0x7ffc34b26340, retOperLen=@0x7ffc34b26340, requestTagIn=0, requestTagOut=@0x7ffc34b26340, requestStatus=@0x7ffc34b26348, cliError=@0x7ffc34b26338, lobStorageLocation=0x0, storage=Lob_HDFS_File, source=0x0, sourceLen=0, cursorBytes=0, cursorId=0x0, operation=Lob_Cleanup, subOperation=Lob_None, waited=1, globPtr=@0x7f1ea49a7de0, transId=0, blackBox=0x7f1ea4963ea0, blackBoxLen=0, lobMaxSize=0, lobMaxChunkMemSize=0, lobGCLimit=0, bufferSize=0, replication=0, blockSize=0, openType=0) at ../exp/ExpLOBaccess.cpp:2670 #3 0x00007f1e9e2cfdcf in ExpLOBinterfaceCleanup (exLobGlob=@0x7f1ea49a7de0, lobHeap=0x7f1ea4963ea0) at ../exp/ExpLOBinterface.cpp:201 #4 0x00007f1ea01480cf in ExHdfsScanTcb::freeResources (this=0x7f1ea49a7cd8) at ../executor/ExHdfsScan.cpp:283 #5 0x00007f1ea0147b21 in ExHdfsScanTcb::~ExHdfsScanTcb (this=0x7f1ea49a7cd8, __in_chrg=<value optimized out>) at ../executor/ExHdfsScan.cpp:224 #6 0x00007f1ea0147c50 in ExHdfsScanTcb::~ExHdfsScanTcb (this=0x7f1ea49a7cd8, __in_chrg=<value optimized out>) at ../executor/ExHdfsScan.cpp:225 #7 0x00007f1e9fefcce5 in ex_globals::cleanupTcbs (this=0x7f1ea4983b48) ---Type <return> to continue, or q <return> to quit--- at ../executor/ex_globals.cpp:199 #8 0x00007f1e9fefcac1 in ex_globals::deleteMe (this=0x7f1ea4983b48, fatalError=0) at ../executor/ex_globals.cpp:143 #9 0x00007f1e9fed9600 in ExExeStmtGlobals::deleteMe (this=0x7f1ea4983b48, fatalError=0) at ../executor/ex_exe_stmt_globals.cpp:321 #10 0x00007f1e9feda06d in ExMasterStmtGlobals::deleteMe (this=0x7f1ea4983b48, fatalError=0) at ../executor/ex_exe_stmt_globals.cpp:672 #11 0x00007f1e9ff41f42 in ex_root_tcb::deallocAndDelete (this=0x7f1ea49a8660, glob=0x7f1ea4983b48, fragTable=0x7f1ea49a7980) at ../executor/ex_root.cpp:2470 #12 0x00007f1ea1922345 in CliStatement::releaseTcbs (this=0x7f1ea4963de0, closeAllOpens=0) at ../cli/Statement.cpp:4372 #13 0x00007f1ea1922506 in CliStatement::dealloc (this=0x7f1ea4963de0, closeAllOpens=0) at ../cli/Statement.cpp:4420 #14 0x00007f1ea1917caa in CliStatement::~CliStatement (this=0x7f1ea4963de0, __in_chrg=<value optimized out>) at ../cli/Statement.cpp:575 #15 0x00007f1ea19187bc in CliStatement::~CliStatement (this=0x7f1ea4963de0, __in_chrg=<value optimized out>) at ../cli/Statement.cpp:745 #16 0x00007f1ea18cad3e in ContextCli::deallocStmt (this=0x7f1ea49c2120, statement_id=0x3a14d70, deallocStaticStmt=0) at ../cli/Context.cpp:2348 #17 0x00007f1ea1890c89 in SQLCLI_DeallocStmt (cliGlobals=0x296bfe0, statement_id=0x3a14d70) at ../cli/Cli.cpp:1660 #18 0x00007f1ea1935309 in SQL_EXEC_DeallocStmt (statement_id=0x3a14d70) ---Type <return> to continue, or q <return> to quit--- at ../cli/CliExtern.cpp:1832 #19 0x00007f1ea42622c9 in SqlCmd::deallocate (sqlci_env=0x2954a10, prep_stmt= 0x3bd67b0) at ../sqlci/SqlCmd.cpp:2854 #20 0x00007f1ea426283c in DML::process (this=0x489e460, sqlci_env=0x2954a10) at ../sqlci/SqlCmd.cpp:2988 #21 0x00007f1ea424300c in Obey::process (this=0x623ca00, sqlci_env=0x2954a10) at ../sqlci/Obey.cpp:267 #22 0x00007f1ea424300c in Obey::process (this=0x2939090, sqlci_env=0x2954a10) at ../sqlci/Obey.cpp:267 #23 0x00007f1ea424bb43 in SqlciEnv::run (this=0x2954a10, in_filename= 0x7ffc34b291f3 "TEST007.tmp", input_string=0x0) at ../sqlci/SqlciEnv.cpp:720 #24 0x000000000040272b in main (argc=2, argv=0x7ffc34b26f38) at ../bin/SqlciMain.cpp:329 (gdb) So, we are waiting for some other thread to complete which never completes. Looking at the threads in the process I see: (gdb) info thread 37 Thread 0x7f1e93799700 (LWP 9140) 0x00007f1ea1c964fd in accept () from /lib64/libc.so.6 36 Thread 0x7f1e92d98700 (LWP 9141) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 35 Thread 0x7f1e9046c700 (LWP 9142) 0x00007f1ea1be071d in sigtimedwait () from /lib64/libc.so.6 34 Thread 0x7f1e8f96a700 (LWP 9157) 0x00007f1ea1c95f33 in epoll_wait () from /lib64/libc.so.6 33 Thread 0x7f1e8b5e7700 (LWP 9168) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 32 Thread 0x7f1e8b4e6700 (LWP 9169) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 31 Thread 0x7f1e8b3e5700 (LWP 9170) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 30 Thread 0x7f1e8b2e4700 (LWP 9171) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 29 Thread 0x7f1e8b1e3700 (LWP 9172) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 28 Thread 0x7f1e8b0e2700 (LWP 9173) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 27 Thread 0x7f1e8afe1700 (LWP 9174) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 26 Thread 0x7f1e8aee0700 (LWP 9175) 0x00007f1e9be5e63c in pthread_cond_wait@@---Type <return> to continue, or q <return> to quit--- GLIBC_2.3.2 () from /lib64/libpthread.so.0 25 Thread 0x7f1e87824700 (LWP 9177) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 24 Thread 0x7f1e87723700 (LWP 9180) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 23 Thread 0x7f1e87622700 (LWP 9182) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 22 Thread 0x7f1e81690700 (LWP 9183) 0x00007f1e9be609b0 in sem_wait () from /lib64/libpthread.so.0 21 Thread 0x7f1e8158f700 (LWP 9184) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 20 Thread 0x7f1e8148e700 (LWP 9193) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 19 Thread 0x7f1e8138d700 (LWP 9195) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 18 Thread 0x7f1e8128c700 (LWP 9196) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 17 Thread 0x7f1e7e3ba700 (LWP 9247) 0x00007f1ea1c95f33 in epoll_wait () from /lib64/libc.so.6 16 Thread 0x7f1e7e0b8700 (LWP 9248) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 15 Thread 0x7f1e7dfb7700 (LWP 9301) 0x00007f1ea1c8c113 in poll () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- 14 Thread 0x7f1e7a712700 (LWP 10456) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 13 Thread 0x7f1e7a20f700 (LWP 10528) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 12 Thread 0x7f1e7a10e700 (LWP 10531) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 11 Thread 0x7f1e7a00d700 (LWP 10532) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 10 Thread 0x7f1e79f0c700 (LWP 10722) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 9 Thread 0x7f1e79e0b700 (LWP 10723) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 8 Thread 0x7f1e79d0a700 (LWP 10724) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 7 Thread 0x7f1e7ad18700 (LWP 14325) 0x00007f1ea1c95f33 in epoll_wait () from /lib64/libc.so.6 6 Thread 0x7f1e7b01b700 (LWP 14326) 0x00007f1e9be5e63c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 5 Thread 0x7f1e7b21d700 (LWP 14328) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 0x7f1e7b31e700 (LWP 14329) 0x00007f1e9be5ea0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7f1e7ba26700 (LWP 14336) 0x00007f1e9be5ea0e in pthread_cond_timedw---Type <return> to continue, or q <return> to quit--- ait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x7f1e7cea0700 (LWP 16913) 0x00007f1e9be612e4 in __lll_lock_wait () from /lib64/libpthread.so.0 * 1 Thread 0x7f1ea4aa42c0 (LWP 9113) 0x00007f1e9be5b2ad in pthread_join () from /lib64/libpthread.so.0 (gdb) Everything is waiting. I'm not sure what thread is interesting to look at but thread 2 is different. Looking at that one: (gdb) thread 2 [Switching to thread 2 (Thread 0x7f1e7cea0700 (LWP 16913))]#0 0x00007f1e9be612e4 in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f1e9be612e4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f1e9be5c588 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x00007f1e9be5c457 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f1e9e2c913c in ExLobLock::lock (this=0x3c52480) at ../exp/ExpLOBaccess.cpp:3281 #4 0x00007f1e9e2c80e6 in ExLobGlobals::performRequest (this=0x3c4e640, request=0x3c53a90) at ../exp/ExpLOBaccess.cpp:2963 #5 0x00007f1e9e2c96fb in ExLobGlobals::doWorkInThread (this=0x3c4e640) at ../exp/ExpLOBaccess.cpp:3441 #6 0x00007f1e9e2c8fc9 in workerThreadMain (arg=0x3c4e640) at ../exp/ExpLOBaccess.cpp:3235 #7 0x00007f1e9be5aa51 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f1ea1c9593d in clone () from /lib64/libc.so.6 At this point I killed the process and let the tests continue. Is this a known bug? Thanks, Dave
