Would there be a huge problem to add a modulus 65535 to avoid this without moving to a hash and get performance impact? Eric
-----Original Message----- From: Selva Govindarajan [mailto:[email protected]] Sent: Tuesday, September 8, 2015 12:27 PM To: [email protected] Cc: Lijian (Q) <[email protected]> Subject: RE: [Urgent Help] Trafodion Build Environment Problem The whole Trafodion stack may not have been tested for pids more than 65K. However, the problems with pids more than 65k will be first observed by mxssmp or mxsscp processes and it dumps core. These processes provide the capability to trouble shoot problems with query execution in Trafodion infrastructure by providing real time execution statistics. Every Trafodion SQL processes is registered when it calls Trafodion SQL Cli calls and unregisters itself when it goes away. Internally, we use array for this purpose for performance reasons. Selva -----Original Message----- From: Qifan Chen [mailto:[email protected]] Sent: Tuesday, September 8, 2015 10:03 AM To: dev <[email protected]> Cc: Lijian (Q) <[email protected]> Subject: Re: [Urgent Help] Trafodion Build Environment Problem For pids larger than 65K, we probably can use a hash table. Thanks --Qifan On Tue, Sep 8, 2015 at 11:27 AM, Hans Zeller <[email protected]> wrote: > Hi Nieyuanyuan, > > Some of us are also working on running Trafodion in a sandbox or on > Apache objects. We hope to have documented steps on how to do that > eventually. You mention you had to fix several things. If you have > notes on what those are, would you share them? > > Thank you, > > Hans > > On Tue, Sep 8, 2015 at 9:19 AM, Amanda Moran <[email protected]> > wrote: > > > Hi there- > > > > This is fixed in latest version of installer. > > > > Thanks. > > > > Sent from my iPhone > > > > > On Sep 8, 2015, at 9:07 AM, Dave Birdsall > > > <[email protected]> > > wrote: > > > > > > Hi, > > > > > > I'm wondering if this should be reported as a problem? Perhaps > > Nieyuanyuan > > > would like to open a JIRA about supporting higher PID numbers in > > Trafodion? > > > > > > Dave > > > > > > -----Original Message----- > > > From: Narendra Goyal [mailto:[email protected]] > > > Sent: Monday, September 7, 2015 7:04 PM > > > To: [email protected] > > > Cc: Lijian (Q) <[email protected]> > > > Subject: RE: [Urgent Help] Trafodion Build Environment Problem > > > > > > Hi Nieyuanyuan, > > > > > > Could you please check the 'pid_max' settings: > > > sysctl -q kernel.pid_max > > > (or cat /proc/sys/kernel/pid_max) > > > > > > If the value is > 64K, I would recommend you set it to 64K, like so: > > > sudo sysctl -w kernel.pid_max=65535 > > > > > > You will have to restart Tradfodion and other Hadoop/HBase processes: > > > swstopall > > > ckillall > > > swstartall > > > sqstart > > > > > > Just fyi, to check the list of Trafodion processes only, please > > > run > > 'cstat' > > > on your bash. > > > > > > Thanks, > > > -Narendra > > > > > > > > > -----Original Message----- > > > From: Nieyuanyuan [mailto:[email protected]] > > > Sent: Monday, September 7, 2015 6:40 PM > > > To: [email protected] > > > Cc: Lijian (Q) <[email protected]> > > > Subject: [Urgent Help] Trafodion Build Environment Problem > > > > > > Dear Guys, > > > > > > I recently downloaded trafodion 1.1 from > > > https://github.com/apache/incubator-trafodion/tree/stable/1.1, and > > followed > > > the build guide from > > > https://wiki.trafodion.org/wiki/index.php/Building_the_Software, > > > and > > solved > > > a lot of problems (no need to list all details), I am able to run > > trafodion > > > over a hadoop sandbox environment. > > > > > > But I got a serious problem, that is, all Trafodion related > > > process > will > > go > > > down after several minutes (not sure how long), only few of them > > > will > > > left: > > > [nieyy@redhat-72 ~]$ ps ux > > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME > COMMAND > > > nieyy 76554 0.1 0.1 590988 139768 pts/6 Sl 19:14 0:04 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > nieyy 118833 0.7 0.3 1535452 420996 ? Sl 19:40 0:12 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_namenode -Xmx1000m > > > -Djava.net.prefe > > > nieyy 119085 0.6 0.2 1572688 367388 ? Sl 19:40 0:10 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_datanode -Xmx1000m > > > -Djava.net.prefe > > > nieyy 119320 0.4 0.2 1512656 340636 ? Sl 19:41 0:07 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_secondarynamenode -Xmx1000m -Djava. > > > nieyy 119972 1.2 0.2 1708408 378536 pts/6 Sl 19:41 0:20 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_resourcemanager -Xmx1000m -Dhadoop. > > > nieyy 120133 0.9 0.2 1616388 309976 ? Sl 19:41 0:16 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_nodemanager -Xmx1000m -Dhadoop.log. > > > nieyy 120371 0.0 0.0 9824 1772 pts/6 S 19:41 0:00 > /bin/sh > > > ./bin/mysqld_safe > > > > --defaults-file=/home/nieyy/trafodion_build/incubator-trafodion-stable-1. > > > nieyy 120594 0.0 0.0 452604 89908 pts/6 Sl 19:41 0:01 > > > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sq > l/lo > > > cal_hadoop/mysql/bin/mysq > > > nieyy 120789 0.0 0.0 9692 1736 pts/6 S 19:41 0:00 bash > > > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sq > l/lo > > > cal_hadoop/hbase/bin > > > nieyy 120806 2.0 0.3 1809048 509164 pts/6 Sl 19:41 0:34 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -Dproc_master -XX:OnOutOfMemoryError=kill > > > nieyy 122554 0.0 0.0 13624 1304 pts/6 S 19:41 0:00 > mpirun > > > -disable-auto-cleanup -demux select -env SQ_IC TCP -env > > > MPI_ERROR_LEVEL > > > 2 -env SQ_PIDMAP 1 - > > > nieyy 122555 0.0 0.0 0 0 ? Zs 19:41 0:00 > > > [hydra_pmi_proxy] <defunct> > > > nieyy 122556 1.0 0.0 335212 36748 ? Ssl 19:41 0:17 > > > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ex > port > > > /bin64d/monitor COLD > > > nieyy 122557 0.8 0.0 335212 36768 ? Ssl 19:41 0:14 > > > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ex > port > > > /bin64d/monitor COLD > > > nieyy 123946 0.9 0.1 828072 223088 pts/6 Sl 19:42 0:14 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > nieyy 124044 1.0 0.1 629200 187180 pts/6 Sl 19:42 0:16 > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > > > > And then I need to kill all processes and use swstartall and > > > sqstart to reset the environment, however, the environment will > > > still go down > after > > a > > > while, and I need to restart again. > > > > > > I found some cores under > > > trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/script > > > s, > all > > > cored were generated by mxssmp: > > > [nieyy@redhat-72 scripts]$ ll core* ... > > > -rw------- 1 nieyy nieyy 156008448 Sep 7 17:56 core.mxssmp.173357 > > > -rw------- 1 nieyy nieyy 145518592 Sep 7 17:56 core.mxssmp.173372 > > > -rw------- 1 nieyy nieyy 156008448 Sep 7 19:24 core.mxssmp.74146 > > > -rw------- 1 nieyy nieyy 145518592 Sep 7 19:24 core.mxssmp.74197 > > > > > > I used gdb to track the stack: > > > [nieyy@redhat-72 scripts]$ gdb > > > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sql/li > b/li > > > nux/64bit/debug/mxssmp ./core.mxssmp.141469 ... > > > (gdb) where > > > #0 0x000000000044166c in ProcessStats::getHeap (this=0x2000) at > > > ../runtimestats/SqlStats.h:271 > > > #1 0x000000000043990a in StatsGlobals::removeProcess > > > (this=0x10000000, pid=65536, calledAtAdd=0) at > > > ../runtimestats/SqlStats.cpp:276 > > > #2 0x0000000000439e05 in StatsGlobals::checkForDeadProcesses > > > (this=0x10000000, myPid=141469) at > > > ../runtimestats/SqlStats.cpp:382 > > > #3 0x00000000004440be in SsmpGlobals::work (this=0x7f062660c7e8) > > > at > > > ../runtimestats/ssmpipc.cpp:582 > > > #4 0x000000000042f06a in runServer (argc=1, argv=0x7fff5b0e5a48) > > > at > > > ../bin/ex_ssmp_main.cpp:259 > > > #5 0x000000000042eb12 in main (argc=1, argv=0x7fff5b0e5a48) at > > > ../bin/ex_ssmp_main.cpp:127 > > > > > > Then I searched via Google, and found a link > > > https://bugs.launchpad.net/trafodion/+bug/1368891 which looks > > > similar, > > but > > > it claimed the bug has been fixed at v0.9, but my version is 1.1. > > > > > > So, could you kindly help me to solve this problem cause I can't > > > find > > more > > > useful information via Google. > > > > > > Thanks a lot. > > > -- Regards, --Qifan
