I don't know. I think the other parameter is more important:
net.core.somaxconn=1024 (original 128)
net.ipv4.tcp_synack_retries=2 (original 5)
Since I found many connections were in SYN_RECV status, my purpose of
changing these 2 parameters are:
net.ipv4.tcp_synack_retries: Reduce the waiting ti
Thanks for keeping us updated Rural!
I'm still curious why changing net.core.somaxconn in the kernel helped here
if you didn't change ipc.server.listen.queue.size. Perhaps that property is
in hdfs-site.xml or core-site.xml with a higher value?
cheers,
esteban.
--
Cloudera, Inc.
On Mon, Jul 2
Just update my result: Since HBASE-11277 was applied, I have not seen
any connection problem for a week. Before, the connection problem almost
occurred everyday.
No. I didn't touch ipc.server.listen.queue.size. Anyway my change
mitigated my problem as I stated in another thread: From the observation
in these 2 days after the action was taken, the frequency of the problem
has been reduced. The huge improvement is, even when the problem
happens, the RS ca
Hello Rural,
Thats interesting, unless you have changed ipc.server.listen.queue.size in
the HBase Region Server (and other Hadoop daemons) to value higher than
128, you might have worked around the issue by increasing the listen queue
(globally) for a service that doesn't explicitly set the queue
the max number of files has already been set to 32768 for the user
running hbase/hadoop. I think there should be errors in log if it's the
file number problem. The count of connections in SYN_RECV state is about
100. I also checked the source of those connections and they are from
the hosts of
For how long you noticed that connections? when you say "many" do you mean
1000s? the problem with having too many syn_recv is that you could end
running out of file descriptors, which makes me wonder know what is the
maximum number of open files that you have configured for the RS process
(see all
One additional info, I did 'netstat -an |grep 60020' when the problem
happened, I saw many connections from remote to local port 60020 are on
state "SYN_RECV". Not sure if that indicates anything.
Yes, I can take more if needed when it happens next time.
于 2014/7/10 17:11, Ted Yu 写道:
I noticed the blockSeek() call in HFileReaderV2.
Did you take only one dump during the 20 minute hang ?
Cheers
I noticed the blockSeek() call in HFileReaderV2.
Did you take only one dump during the 20 minute hang ?
Cheers
On Jul 10, 2014, at 1:54 AM, Rural Hunter wrote:
> I got the dump of the problematic rs from web ui: http://pastebin.com/4hfhkDUw
> output of "top -H -p ": http://pastebin.com/LtzkSc
I got the dump of the problematic rs from web ui:
http://pastebin.com/4hfhkDUw
output of "top -H -p ": http://pastebin.com/LtzkScYY
I also got the output of jstack but I believe it's already in the dump
so I do not paste it again. This time the hang lasted about 20 minutes.
于 2014/7/9 12:48, E
Hi Esteban,
Yes I use the ZK managed by hbase. I will try to get the jstack and
other info when this happens again.
于 2014/7/9 12:48, Esteban Gutierrez 写道:
Hi Rural,
Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase? If
Hi Rural,
Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase? If you experience the issue again, can you try to obtain a jstack
(as the user that started the hbase process or try from the RS UI if
responsive rs:port/dump) as T
No. I used the standard log4j file and there is not any network problem
from the client. I checked the web admin ui and the master still take
the slave as working. Just the request count is very small(about 10
while others are several hundreds). I sshed on the slave server and I
can see the 600
Hello Rural,
It doesn't seem to be a problem from the region server from what I can
tell. The RS is not showing in the logs any message about a long pause
(unless you have a non standard log4j.properties file) and also if the RS
was in a very long pause due GC or any other issue, then the master s
OK, I will try to do that when it happens again. Thanks.
于 2014/7/8 17:06, Ted Yu 写道:
Next time this happens, can you take jstack of the region server and pastebin
it ?
Thanks
Next time this happens, can you take jstack of the region server and pastebin
it ?
Thanks
On Jul 7, 2014, at 11:06 PM, Rural Hunter wrote:
> Hi,
>
> I'm using hbase-0.96.2. I saw sometimes my region servers don't accept
> connections from clients. this could last like 10 minutes to half hour
I checked the parameter and it seems also a gc parameter to print the
total time of "stop the world". So will it help to get the info about
hung "was not caused by GC "?
于 2014/7/8 14:28, 谢良 写道:
> Coud you try with "-XX:+PrintGCApplicationStoppedTime" vm parameter ?
> the hung from vm side was not
Coud you try with "-XX:+PrintGCApplicationStoppedTime" vm parameter ?
the hung from vm side was not caused by GC always
Thanks,
发件人: Rural Hunter [ruralhun...@gmail.com]
发送时间: 2014年7月8日 14:06
收件人: user@hbase.apache.org
主题: Region server
Hi,
I'm using hbase-0.96.2. I saw sometimes my region servers don't accept
connections from clients. this could last like 10 minutes to half hour.
I was not able to connect to the 60020 port even with telnet command
when it happened. After a while, the problem disappeared and the region
serve
20 matches
Mail list logo