Thanks for keeping us updated Rural!
I'm still curious why changing net.core.somaxconn in the kernel helped here
if you didn't change ipc.server.listen.queue.size. Perhaps that property is
in hdfs-site.xml or core-site.xml with a higher value?
cheers,
esteban.
--
Cloudera, Inc.
On Mon, Jul
I don't know. I think the other parameter is more important:
net.core.somaxconn=1024 (original 128)
net.ipv4.tcp_synack_retries=2 (original 5)
Since I found many connections were in SYN_RECV status, my purpose of
changing these 2 parameters are:
net.ipv4.tcp_synack_retries: Reduce the waiting
Just update my result: Since HBASE-11277 was applied, I have not seen
any connection problem for a week. Before, the connection problem almost
occurred everyday.
Hello Rural,
Thats interesting, unless you have changed ipc.server.listen.queue.size in
the HBase Region Server (and other Hadoop daemons) to value higher than
128, you might have worked around the issue by increasing the listen queue
(globally) for a service that doesn't explicitly set the queue
No. I didn't touch ipc.server.listen.queue.size. Anyway my change
mitigated my problem as I stated in another thread: From the observation
in these 2 days after the action was taken, the frequency of the problem
has been reduced. The huge improvement is, even when the problem
happens, the RS
One additional info, I did 'netstat -an |grep 60020' when the problem
happened, I saw many connections from remote to local port 60020 are on
state SYN_RECV. Not sure if that indicates anything.
For how long you noticed that connections? when you say many do you mean
1000s? the problem with having too many syn_recv is that you could end
running out of file descriptors, which makes me wonder know what is the
maximum number of open files that you have configured for the RS process
(see all
the max number of files has already been set to 32768 for the user
running hbase/hadoop. I think there should be errors in log if it's the
file number problem. The count of connections in SYN_RECV state is about
100. I also checked the source of those connections and they are from
the hosts of
I got the dump of the problematic rs from web ui:
http://pastebin.com/4hfhkDUw
output of top -H -p PID: http://pastebin.com/LtzkScYY
I also got the output of jstack but I believe it's already in the dump
so I do not paste it again. This time the hang lasted about 20 minutes.
于 2014/7/9 12:48,
I noticed the blockSeek() call in HFileReaderV2.
Did you take only one dump during the 20 minute hang ?
Cheers
On Jul 10, 2014, at 1:54 AM, Rural Hunter ruralhun...@gmail.com wrote:
I got the dump of the problematic rs from web ui: http://pastebin.com/4hfhkDUw
output of top -H -p PID:
Yes, I can take more if needed when it happens next time.
于 2014/7/10 17:11, Ted Yu 写道:
I noticed the blockSeek() call in HFileReaderV2.
Did you take only one dump during the 20 minute hang ?
Cheers
Hi,
I'm using hbase-0.96.2. I saw sometimes my region servers don't accept
connections from clients. this could last like 10 minutes to half hour.
I was not able to connect to the 60020 port even with telnet command
when it happened. After a while, the problem disappeared and the region
Coud you try with -XX:+PrintGCApplicationStoppedTime vm parameter ?
the hung from vm side was not caused by GC always
Thanks,
发件人: Rural Hunter [ruralhun...@gmail.com]
发送时间: 2014年7月8日 14:06
收件人: user@hbase.apache.org
主题: Region server not accept
I checked the parameter and it seems also a gc parameter to print the
total time of stop the world. So will it help to get the info about
hung was not caused by GC ?
于 2014/7/8 14:28, 谢良 写道:
Coud you try with -XX:+PrintGCApplicationStoppedTime vm parameter ?
the hung from vm side was not caused
Next time this happens, can you take jstack of the region server and pastebin
it ?
Thanks
On Jul 7, 2014, at 11:06 PM, Rural Hunter ruralhun...@gmail.com wrote:
Hi,
I'm using hbase-0.96.2. I saw sometimes my region servers don't accept
connections from clients. this could last like 10
OK, I will try to do that when it happens again. Thanks.
于 2014/7/8 17:06, Ted Yu 写道:
Next time this happens, can you take jstack of the region server and pastebin
it ?
Thanks
Hello Rural,
It doesn't seem to be a problem from the region server from what I can
tell. The RS is not showing in the logs any message about a long pause
(unless you have a non standard log4j.properties file) and also if the RS
was in a very long pause due GC or any other issue, then the master
No. I used the standard log4j file and there is not any network problem
from the client. I checked the web admin ui and the master still take
the slave as working. Just the request count is very small(about 10
while others are several hundreds). I sshed on the slave server and I
can see the
Hi Rural,
Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase? If you experience the issue again, can you try to obtain a jstack
(as the user that started the hbase process or try from the RS UI if
responsive rs:port/dump) as
Hi Esteban,
Yes I use the ZK managed by hbase. I will try to get the jstack and
other info when this happens again.
于 2014/7/9 12:48, Esteban Gutierrez 写道:
Hi Rural,
Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase?
20 matches
Mail list logo