I've logged this issue several days ago, is this ok? https://issues.apache.org/jira/browse/TRAFODION-1492
Will try with one user and let you know. On Fri, Sep 18, 2015 at 7:05 AM, Suresh Subbiah <[email protected]> wrote: > Hi > > How many Virtual users are being used? If it is more than one could we > please try the case with 1 user first. > > When the crash happens next time could we please try > sqps | grep esp | wc -l > > If this number is large we know a lot of esp processes are being started > which could consume memory. > If this is the case please insert this row into the defaults table from > sqlci and the restart dcs (dcsstop followed by dcsstart) > insert into "_MD_".defaults values('ATTEMPT_ESP_PARALLELISM', 'OFF', > 'hammerdb testing') ; > exit ; > > I will work having the udr process create a JVM with a smaller initial heap > size. If you have time and would like to do so a, JIRA you file will be > helpful. Or I can file the JIRA and work on it. It will not take long to > make this change. > > Thanks > Suresh > > PS I found this command from stackOverflow to determine the initialHeapSize > we get by default in this env > > java -XX:+PrintFlagsFinal -version | grep HeapSize > > > > > http://stackoverflow.com/questions/4667483/how-is-the-default-java-heap-size-determined > > > > On Thu, Sep 17, 2015 at 10:32 AM, Radu Marias <[email protected]> > wrote: > > > Did the steps mentioned above to ensure that the trafodion processes are > > free of JAVA installation mixup. > > Also changed so that hdp, trafodion and hammerdb uses the same jdk from > > */usr/jdk64/jdk1.7.0_67* > > > > # java -version > > java version "1.7.0_67" > > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > > > > # echo $JAVA_HOME > > /usr/jdk64/jdk1.7.0_67 > > > > But when running hammerdb I got again a crash on 2 nodes. I noticed that > > before the crash for about one minute I'm getting errors for *java > > -version* and > > about 30 seconds after the crash the java -version worked again. So these > > issues might be related. Didn't yet found the problem and how to fix the > > java -version issue. > > > > # java -version > > Error occurred during initialization of VM > > Could not reserve enough space for object heap > > Error: Could not create the Java Virtual Machine. > > Error: A fatal exception has occurred. Program will exit > > > > # file core.5813 > > core.5813: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, > > from 'tdm_udrserv SQMON1.1 00000 00000 005813 $Z0004R3 > > 188.138.61.175:48357 > > 00004 000' > > > > #0 0x00007f6920ba0625 in raise () from /lib64/libc.so.6 > > #1 0x00007f6920ba1e05 in abort () from /lib64/libc.so.6 > > #2 0x0000000000424369 in comTFDS (msg1=0x43c070 "Trafodion UDR Server > > Internal Error", msg2=<value optimized out>, msg3=0x7fff119787f0 "Source > > file information unavailable", > > msg4=0x7fff11977ff0 "User routine being processed : > > TRAFODION.TPCC.NEWORDER, Routine Type : Stored Procedure, Language Type : > > JAVA, Error occurred outside the user routine code", msg5=0x43ddc3 "", > > dialOut=<value optimized out>, writeToSeaLog=1) at > > ../udrserv/UdrFFDC.cpp:191 > > #3 0x00000000004245d7 in makeTFDSCall (msg=0x7f692324b310 "The Java > > virtual machine aborted", file=<value optimized out>, line=<value > optimized > > out>, dialOut=1, writeToSeaLog=1) at ../udrserv/UdrFFDC.cpp:219 > > #4 0x00007f69232316b8 in LmJavaHooks::abortHookJVM () at > > ../langman/LmJavaHooks.cpp:54 > > #5 0x00007f69229cbbc6 in ParallelScavengeHeap::initialize() () from > > /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #6 0x00007f6922afedba in Universe::initialize_heap() () from > > /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #7 0x00007f6922afff89 in universe_init() () from > > /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #8 0x00007f692273d9f5 in init_globals() () from > > /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #9 0x00007f6922ae78ed in Threads::create_vm(JavaVMInitArgs*, bool*) () > > from /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #10 0x00007f69227c5a34 in JNI_CreateJavaVM () from > > /usr/jdk64/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so > > #11 0x00007f692322de51 in LmLanguageManagerJava::initialize (this=<value > > optimized out>, result=<value optimized out>, maxLMJava=<value optimized > > out>, userOptions=0x7f69239ba418, diagsArea=<value optimized out>) at > > ../langman/LmLangManagerJava.cpp:379 > > #12 0x00007f692322f564 in LmLanguageManagerJava::LmLanguageManagerJava > > (this=0x7f69239bec38, result=@0x7fff1197e19c, commandLineMode=<value > > optimized out>, maxLMJava=1, userOptions=0x7f69239ba418, > > diagsArea=0x7f6923991780) at ../langman/LmLangManagerJava.cpp:155 > > #13 0x0000000000425619 in UdrGlobals::getOrCreateJavaLM > > (this=0x7f69239ba040, result=@0x7fff1197e19c, diags=<value optimized > out>) > > at ../udrserv/udrglobals.cpp:322 > > #14 0x0000000000427328 in processALoadMessage (UdrGlob=0x7f69239ba040, > > msgStream=..., request=..., env=<value optimized out>) at > > ../udrserv/udrload.cpp:163 > > #15 0x000000000042fbfd in processARequest (UdrGlob=0x7f69239ba040, > > msgStream=..., env=...) at ../udrserv/udrserv.cpp:660 > > #16 0x000000000043269c in runServer (argc=2, argv=0x7fff1197e528) at > > ../udrserv/udrserv.cpp:520 > > #17 0x000000000043294e in main (argc=2, argv=0x7fff1197e528) at > > ../udrserv/udrserv.cpp:356 > > > > On Wed, Sep 16, 2015 at 6:03 PM, Suresh Subbiah < > > [email protected]> > > wrote: > > > > > Hi, > > > > > > I have added a wiki page that describes how to get a stack trace from a > > > core file. The page could do with some improvements on finding the core > > > file and maybe even doing more than getting thestack trace. For now it > > > should make our troubleshooting cycle faster if the stack trace is > > included > > > in the initial message itself. > > > > > > > > > https://cwiki.apache.org/confluence/display/TRAFODION/Obtain+stack+trace+from+a+core+file > > > > > > In this case, the last node does not seem to have gdb, so I could not > see > > > the trace there. I moved the core file to the first node but then the > > trace > > > looks like this. I assume this is because I moved the core file to a > > > different node. I think Selva's suggestion is good to try. We may have > > had > > > a few tdm_udrserv processes from before the time the java change was > > made. > > > > > > $ gdb tdm_udrserv core.49256 > > > #0 0x00007fe187a674fe in __longjmp () from /lib64/libc.so.6 > > > #1 0x8857780a58ff2155 in ?? () > > > Cannot access memory at address 0x8857780a58ff2155 > > > > > > The back trace we saw yesterday when a udrserv process exited when JVM > > > could not be started is used in the wiki page instead of this one. If > you > > > have time a JIRA on this unexpected udrserv exit will also be valuable > > for > > > the Trafodion team. > > > > > > Thanks > > > Suresh > > > > > > On Wed, Sep 16, 2015 at 8:39 AM, Selva Govindarajan < > > > [email protected]> wrote: > > > > > > > Thanks for creating the JIRA Trafodion-1492. The error is similar to > > > > scenario-2. The process tdm_udrserv dumped core. We will look into > the > > > core > > > > file. In the meantime, can you please do the following: > > > > > > > > Bring the Trafodion instance down > > > > echo $MY_SQROOT -- shows Trafodion installation directory > > > > Remove $MY_SQROOT/etc/ms.env from all nodes > > > > > > > > > > > > Start a New Terminal Session so that new Java settings are in place > > > > Login as a Trafodion user > > > > cd <trafodion_installation_directory> > > > > . ./sqenv.sh (skip this if it is done automatically upon logon) > > > > sqgen > > > > > > > > Exit and Start a New Terminal Session > > > > Restart the Trafodion instance and check if you are seeing the issue > > with > > > > tdm_udrserv again. We wanted to ensure that the trafodion processes > are > > > > free > > > > of JAVA installation mixup in your earlier message. We suspect that > can > > > > cause tdm_udrserv process to dump core. > > > > > > > > > > > > Selva > > > > > > > > -----Original Message----- > > > > From: Radu Marias [mailto:[email protected]] > > > > Sent: Wednesday, September 16, 2015 5:40 AM > > > > To: dev <[email protected]> > > > > Subject: Re: odbc and/or hammerdb logs > > > > > > > > I'm seeing this in hammerdb logs, I assume is due to the crash and > some > > > > processes are stopped: > > > > > > > > Error in Virtual User 1: [Trafodion ODBC Driver][Trafodion Database] > > SQL > > > > ERROR:*** ERROR[2034] $Z0106BZ:16: Operating system error 201 while > > > > communicating with server process $Z010LPE:23. [2015-09-16 12:35:33] > > > > [Trafodion ODBC Driver][Trafodion Database] SQL ERROR:*** ERROR[8904] > > SQL > > > > did not receive a reply from MXUDR, possibly caused by internal > errors > > > when > > > > executing user-defined routines. [2015-09-16 12:35:33] > > > > > > > > $ sqcheck > > > > Checking if processes are up. > > > > Checking attempt: 1; user specified max: 2. Execution time in > seconds: > > 0. > > > > > > > > The SQ environment is up! > > > > > > > > > > > > Process Configured Actual Down > > > > ------- ---------- ------ ---- > > > > DTM 5 5 > > > > RMS 10 10 > > > > MXOSRVR 20 20 > > > > > > > > On Wed, Sep 16, 2015 at 3:28 PM, Radu Marias <[email protected]> > > > wrote: > > > > > > > > > I've restarted hdp and trafodion and now I managed to create the > > > > > schema and stored procedures from hammerdb. But I'm getting fails > and > > > > > dump core again by trafodion while running virtual users. For some > of > > > > > the users I sometimes see in hammerdb logs: > > > > > Vuser 5:Failed to execute payment > > > > > Vuser 5:Failed to execute stock level > > > > > Vuser 5:Failed to execute new order > > > > > > > > > > Core files are on out last node, feel free to examine them, the > files > > > > > were dumped while getting hammerdb errors: > > > > > > > > > > *core.49256* > > > > > > > > > > *core.48633* > > > > > > > > > > *core.49290* > > > > > > > > > > > > > > > On Wed, Sep 16, 2015 at 3:24 PM, Radu Marias <[email protected] > > > > > > wrote: > > > > > > > > > >> *Scenario 1:* > > > > >> > > > > >> I've created this issue > > > > >> https://issues.apache.org/jira/browse/TRAFODION-1492 > > > > >> I think another fix was made related to *Committed_AS* in > > > > >> *sql/cli/memmonitor.cpp*. > > > > >> > > > > >> This is a response from Narendra in a previous thread where the > > issue > > > > >> was fixed to start the trafodion: > > > > >> > > > > >> > > > > >>> > > > > >>> > > > > >>> > > > > >>> *I updated the code: sql/cli/memmonitor.cpp, so that if > > > > >>> /proc/meminfo does not have the ‘Committed_AS’ entry, it will > > ignore > > > > >>> it. Built it and put the binary: libcli.so on the veracity box > (in > > > > >>> the $MY_SQROOT/export/lib64 directory – on all the nodes). > > Restarted > > > > the > > > > >>> env and ‘sqlci’ worked fine. > > > > >>> Was able to ‘initialize trafodion’ and create a table.* > > > > >> > > > > >> > > > > >> *Scenario 2:* > > > > >> > > > > >> The *java -version* problem I recall we had only on the other > > cluster > > > > >> with centos 7, I did't seen it on this one with centos 6.7. But a > > > > >> change I made these days in the latter one is installing oracle > *jdk > > > > >> 1.7.0_79* as default one and is where *JAVA_HOME* points to. > Before > > > > >> that some nodes had *open-jdk* as default and others didn't have > one > > > > >> but just the one installed by path by *ambari* in > > > > >> */usr/jdk64/jdk1.7.0_67* but which was not linked to JAVA_HOME or > > > *java* > > > > >> command by *alternatives*. > > > > >> > > > > >> *Failures is HammerDB:* > > > > >> > > > > >> Attached is the *trafodion.dtm.**log* from a node on which I see a > > > > >> lot of lines like these and I assume is the *transaction conflict* > > > > >> that you mentioned, I see these line on 4 out of 5 nodes: > > > > >> > > > > >> 2015-09-14 12:21:49,413 INFO dtm.HBaseTxClient: useForgotten is > true > > > > >> 2015-09-14 12:21:49,414 INFO dtm.HBaseTxClient: forceForgotten is > > > > >> false > > > > >> 2015-09-14 12:21:49,446 INFO dtm.TmAuditTlog: forceControlPoint is > > > > >> false > > > > >> 2015-09-14 12:21:49,446 INFO dtm.TmAuditTlog: useAutoFlush is > false > > > > >> 2015-09-14 12:21:49,447 INFO dtm.TmAuditTlog: ageCommitted is > false > > > > >> 2015-09-14 12:21:49,447 INFO dtm.TmAuditTlog: disableBlockCache is > > > > >> false > > > > >> 2015-09-14 12:21:52,229 INFO dtm.HBaseAuditControlPoint: > > > > >> disableBlockCache is false > > > > >> 2015-09-14 12:21:52,233 INFO dtm.HBaseAuditControlPoint: > > useAutoFlush > > > > >> is false > > > > >> 2015-09-14 12:42:57,346 INFO dtm.HBaseTxClient: Exit > RET_HASCONFLICT > > > > >> prepareCommit, txid: 17179989222 > > > > >> 2015-09-14 12:43:46,102 INFO dtm.HBaseTxClient: Exit > RET_HASCONFLICT > > > > >> prepareCommit, txid: 17179989277 > > > > >> 2015-09-14 12:44:11,598 INFO dtm.HBaseTxClient: Exit > RET_HASCONFLICT > > > > >> prepareCommit, txid: 17179989309 > > > > >> > > > > >> What *transaction conflict* means in this case? > > > > >> > > > > >> On Wed, Sep 16, 2015 at 2:43 AM, Selva Govindarajan < > > > > >> [email protected]> wrote: > > > > >> > > > > >>> Hi Radu, > > > > >>> > > > > >>> Thanks for using Trafodion. With the help from Suresh, we looked > at > > > > >>> the core files in your cluster. We believe that there are two > > > > >>> scenarios that is causing the Trafodion processes to dump core. > > > > >>> > > > > >>> Scenario 1: > > > > >>> Core dumped by tdm_arkesp processes. Trafodion engine has assumed > > > > >>> the entity /proc/meminfo/Committed_AS is available in all flavors > > of > > > > >>> linux. The absence of this entity is not handled correctly by > the > > > > >>> trafodion tdm_arkesp process and hence it dumped core. Please > file > > a > > > > >>> JIRA using this link > > > > >>> https://issues.apache.org/jira/secure/CreateIssue!default.jspa > and > > > > >>> choose "Apache Trafodion" as the project to report a bug against. > > > > >>> > > > > >>> Scenario 2: > > > > >>> Core dumped by tdm_udrserv processes. From our analysis, this > > > > >>> problem happened when the process attempted to create the JVM > > > > >>> instance programmatically. Few days earlier, we have observed > > > > >>> similar issue in your cluster when java -version command was > > > > >>> attempted. But, java -version or $JAVA_HOME/bin/java -version > works > > > > >>> fine now. > > > > >>> Was there any change made to the cluster recently to avoid the > > > > >>> problem with java -version command? > > > > >>> > > > > >>> You can please delete all the core files in sql/scripts directory > > > > >>> and issue the command to invoke SPJ and check if it still dumps > > > > >>> core. We can look at the core file if it happens again. Your > > > > >>> solution to the java -version command would be helpful. > > > > >>> > > > > >>> For the failures with HammerDB, can you please send us the exact > > > > >>> error message returned by the Trafodion engine to the > application. > > > > >>> This might help us to narrow down the cause. You can also look at > > > > >>> $MY_SQROOT/logs/trafodion.dtm.log to check if any transaction > > > > >>> conflict is causing this error. > > > > >>> > > > > >>> Selva > > > > >>> -----Original Message----- > > > > >>> From: Radu Marias [mailto:[email protected]] > > > > >>> Sent: Tuesday, September 15, 2015 9:09 AM > > > > >>> To: dev <[email protected]> > > > > >>> Subject: Re: odbc and/or hammerdb logs > > > > >>> > > > > >>> Also noticed there are several core. files from today in > > > > >>> */home/trafodion/trafodion-20150828_0830/sql/scripts*. If needed > > > > >>> please provide a gmail address so I can share them via gdrive. > > > > >>> > > > > >>> On Tue, Sep 15, 2015 at 6:29 PM, Radu Marias < > [email protected] > > > > > > > >>> wrote: > > > > >>> > > > > >>> > Hi, > > > > >>> > > > > > >>> > I'm running HammerDB over trafodion and when running virtual > > users > > > > >>> > sometimes I get errors like this in hammerdb logs: > > > > >>> > *Vuser 1:Failed to execute payment* > > > > >>> > > > > > >>> > *Vuser 1:Failed to execute new order* > > > > >>> > > > > > >>> > I'm using unixODBC and I tried to add these line in > > > > >>> > */etc/odbc.ini* but the trace file is not created. > > > > >>> > *[ODBC]* > > > > >>> > *Trace = 1* > > > > >>> > *TraceFile = /var/log/odbc_tracefile.log* > > > > >>> > > > > > >>> > Also tried with *Trace = yes* and *Trace = on*, I've found > > > > >>> > multiple references for both. > > > > >>> > > > > > >>> > How can I see more logs to debug the issue? Can I enable logs > for > > > > >>> > all queries in trafodion? > > > > >>> > > > > > >>> > -- > > > > >>> > And in the end, it's not the years in your life that count. > It's > > > > >>> > the life in your years. > > > > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> And in the end, it's not the years in your life that count. It's > > the > > > > >>> life in your years. > > > > >>> > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> And in the end, it's not the years in your life that count. It's > the > > > > life > > > > >> in your years. > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > And in the end, it's not the years in your life that count. It's > the > > > life > > > > > in your years. > > > > > > > > > > > > > > > > > > > > > -- > > > > And in the end, it's not the years in your life that count. It's the > > life > > > > in your years. > > > > > > > > > > > > > > > -- > > And in the end, it's not the years in your life that count. It's the life > > in your years. > > > -- And in the end, it's not the years in your life that count. It's the life in your years.
