The best approach would be neutral and not to restrict the # PIDs to be within a range.
The argument is similar to the one about whether Trafodion should increase a time-out value for HBase Scans. Our conclusion is that to be a good citizen in the Hadoop Eco system, and to be able to handle the mix workload, it is better not to touch that value. Thanks --Qifan On Tue, Sep 8, 2015 at 1:44 PM, Amanda Moran <[email protected]> wrote: > What is the "too small" number? Also what is the guidance if the number is > too large (other than setting the kernel.pid_max=65535 on all nodes). > > Looks like we need two jira's created one for the installer and one for > Trafodion core. > > Thanks. > > On Tue, Sep 8, 2015 at 11:26 AM, Gunnar Tapper <[email protected]> > wrote: > > > Hi, > > > > I am not sure this is a good idea since this might cause issues for the > > overall configuration. For example, Cassandra recommends 999999 for > > kernel.pid_max while Hawq wants at least 798720. IBM Big Insight wants > > another number. Overriding their settings would make Trafodion a bad > > citizen > > in a Hadoop stack. > > > > A better approach might be to check the current value recommending an > > increase if too small and provide guidance if it's too large. > > > > Thanks, > > > > Gunnar > > > > -----Original Message----- > > From: Amanda Moran [mailto:[email protected]] > > Sent: Tuesday, September 8, 2015 12:08 PM > > To: dev <[email protected]> > > Subject: Re: [Urgent Help] Trafodion Build Environment Problem > > > > Hi there All- > > > > Sorry if my first email was confusing. The "problem" itself is not fixed > by > > the installer, the installer just sets sudo sysctl -w > kernel.pid_max=65535 > > on all nodes. > > > > Thanks. > > > > On Tue, Sep 8, 2015 at 11:02 AM, Selva Govindarajan < > > [email protected]> wrote: > > > > > Hi Amanda, > > > > > > I presume that the installer will flag this as a requirement for > > > Trafodion to be installed. Will it abort the installation or will the > > > installer fix the pid_max settings automatically. > > > > > > Selva > > > > > > -----Original Message----- > > > From: Amanda Moran [mailto:[email protected]] > > > Sent: Tuesday, September 8, 2015 9:20 AM > > > To: [email protected] > > > Cc: Lijian (Q) <[email protected]> > > > Subject: Re: [Urgent Help] Trafodion Build Environment Problem > > > > > > Hi there- > > > > > > This is fixed in latest version of installer. > > > > > > Thanks. > > > > > > Sent from my iPhone > > > > > > > On Sep 8, 2015, at 9:07 AM, Dave Birdsall <[email protected]> > > > wrote: > > > > > > > > Hi, > > > > > > > > I'm wondering if this should be reported as a problem? Perhaps > > > > Nieyuanyuan would like to open a JIRA about supporting higher PID > > > numbers in Trafodion? > > > > > > > > Dave > > > > > > > > -----Original Message----- > > > > From: Narendra Goyal [mailto:[email protected]] > > > > Sent: Monday, September 7, 2015 7:04 PM > > > > To: [email protected] > > > > Cc: Lijian (Q) <[email protected]> > > > > Subject: RE: [Urgent Help] Trafodion Build Environment Problem > > > > > > > > Hi Nieyuanyuan, > > > > > > > > Could you please check the 'pid_max' settings: > > > > sysctl -q kernel.pid_max > > > > (or cat /proc/sys/kernel/pid_max) > > > > > > > > If the value is > 64K, I would recommend you set it to 64K, like so: > > > > sudo sysctl -w kernel.pid_max=65535 > > > > > > > > You will have to restart Tradfodion and other Hadoop/HBase > processes: > > > > swstopall > > > > ckillall > > > > swstartall > > > > sqstart > > > > > > > > Just fyi, to check the list of Trafodion processes only, please run > > > 'cstat' > > > > on your bash. > > > > > > > > Thanks, > > > > -Narendra > > > > > > > > > > > > -----Original Message----- > > > > From: Nieyuanyuan [mailto:[email protected]] > > > > Sent: Monday, September 7, 2015 6:40 PM > > > > To: [email protected] > > > > Cc: Lijian (Q) <[email protected]> > > > > Subject: [Urgent Help] Trafodion Build Environment Problem > > > > > > > > Dear Guys, > > > > > > > > I recently downloaded trafodion 1.1 from > > > > https://github.com/apache/incubator-trafodion/tree/stable/1.1, and > > > > followed the build guide from > > > > https://wiki.trafodion.org/wiki/index.php/Building_the_Software, and > > > > solved a lot of problems (no need to list all details), I am able to > > > > run trafodion over a hadoop sandbox environment. > > > > > > > > But I got a serious problem, that is, all Trafodion related process > > > > will go down after several minutes (not sure how long), only few of > > > > them will > > > > left: > > > > [nieyy@redhat-72 ~]$ ps ux > > > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME > > > COMMAND > > > > nieyy 76554 0.1 0.1 590988 139768 pts/6 Sl 19:14 0:04 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > > nieyy 118833 0.7 0.3 1535452 420996 ? Sl 19:40 0:12 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_namenode -Xmx1000m > > > > -Djava.net.prefe > > > > nieyy 119085 0.6 0.2 1572688 367388 ? Sl 19:40 0:10 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_datanode -Xmx1000m > > > > -Djava.net.prefe > > > > nieyy 119320 0.4 0.2 1512656 340636 ? Sl 19:41 0:07 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_secondarynamenode -Xmx1000m -Djava. > > > > nieyy 119972 1.2 0.2 1708408 378536 pts/6 Sl 19:41 0:20 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_resourcemanager -Xmx1000m -Dhadoop. > > > > nieyy 120133 0.9 0.2 1616388 309976 ? Sl 19:41 0:16 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_nodemanager -Xmx1000m -Dhadoop.log. > > > > nieyy 120371 0.0 0.0 9824 1772 pts/6 S 19:41 0:00 > > > /bin/sh > > > > ./bin/mysqld_safe > > > > > > > > --defaults-file=/home/nieyy/trafodion_build/incubator-trafodion-stable-1. > > > > nieyy 120594 0.0 0.0 452604 89908 pts/6 Sl 19:41 0:01 > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ > > > > sq > > > > l/lo > > > > cal_hadoop/mysql/bin/mysq > > > > nieyy 120789 0.0 0.0 9692 1736 pts/6 S 19:41 0:00 > bash > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ > > > > sq > > > > l/lo > > > > cal_hadoop/hbase/bin > > > > nieyy 120806 2.0 0.3 1809048 509164 pts/6 Sl 19:41 0:34 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -Dproc_master -XX:OnOutOfMemoryError=kill > > > > nieyy 122554 0.0 0.0 13624 1304 pts/6 S 19:41 0:00 > > mpirun > > > > -disable-auto-cleanup -demux select -env SQ_IC TCP -env > > > > MPI_ERROR_LEVEL > > > > 2 -env SQ_PIDMAP 1 - > > > > nieyy 122555 0.0 0.0 0 0 ? Zs 19:41 0:00 > > > > [hydra_pmi_proxy] <defunct> > > > > nieyy 122556 1.0 0.0 335212 36748 ? Ssl 19:41 0:17 > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ > > > > ex > > > > port > > > > /bin64d/monitor COLD > > > > nieyy 122557 0.8 0.0 335212 36768 ? Ssl 19:41 0:14 > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/ > > > > ex > > > > port > > > > /bin64d/monitor COLD > > > > nieyy 123946 0.9 0.1 828072 223088 pts/6 Sl 19:42 0:14 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > > nieyy 124044 1.0 0.1 629200 187180 pts/6 Sl 19:42 0:16 > > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java > > > > -XX:OnOutOfMemoryError=kill -9 %p -Xmx128m > > > > > > > > And then I need to kill all processes and use swstartall and sqstart > > > > to reset the environment, however, the environment will still go > > > > down after a while, and I need to restart again. > > > > > > > > I found some cores under > > > > trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/scripts, > > > > all cored were generated by mxssmp: > > > > [nieyy@redhat-72 scripts]$ ll core* > > > > ... > > > > -rw------- 1 nieyy nieyy 156008448 Sep 7 17:56 core.mxssmp.173357 > > > > -rw------- 1 nieyy nieyy 145518592 Sep 7 17:56 core.mxssmp.173372 > > > > -rw------- 1 nieyy nieyy 156008448 Sep 7 19:24 core.mxssmp.74146 > > > > -rw------- 1 nieyy nieyy 145518592 Sep 7 19:24 core.mxssmp.74197 > > > > > > > > I used gdb to track the stack: > > > > [nieyy@redhat-72 scripts]$ gdb > > > > /home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sql/ > > > > li b/li nux/64bit/debug/mxssmp ./core.mxssmp.141469 ... > > > > (gdb) where > > > > #0 0x000000000044166c in ProcessStats::getHeap (this=0x2000) at > > > > ../runtimestats/SqlStats.h:271 > > > > #1 0x000000000043990a in StatsGlobals::removeProcess > > > > (this=0x10000000, pid=65536, calledAtAdd=0) at > > > > ../runtimestats/SqlStats.cpp:276 > > > > #2 0x0000000000439e05 in StatsGlobals::checkForDeadProcesses > > > > (this=0x10000000, myPid=141469) at ../runtimestats/SqlStats.cpp:382 > > > > #3 0x00000000004440be in SsmpGlobals::work (this=0x7f062660c7e8) at > > > > ../runtimestats/ssmpipc.cpp:582 > > > > #4 0x000000000042f06a in runServer (argc=1, argv=0x7fff5b0e5a48) at > > > > ../bin/ex_ssmp_main.cpp:259 > > > > #5 0x000000000042eb12 in main (argc=1, argv=0x7fff5b0e5a48) at > > > > ../bin/ex_ssmp_main.cpp:127 > > > > > > > > Then I searched via Google, and found a link > > > > https://bugs.launchpad.net/trafodion/+bug/1368891 which looks > > > > similar, but it claimed the bug has been fixed at v0.9, but my > version > > > > is 1.1. > > > > > > > > So, could you kindly help me to solve this problem cause I can't > > > > find more useful information via Google. > > > > > > > > Thanks a lot. > > > > > > > > > > > -- > > Thanks, > > > > Amanda Moran > > > > > > -- > Thanks, > > Amanda Moran > -- Regards, --Qifan
