Viral, >From what you described here I can conclude that either:
1. your data access pattern is random and uniform (no data locality at all) 2. or you trash block cache with scan operations with block cache enabled. or both. but the idea of treating local and remote blocks differently is very good. +1. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com ________________________________________ From: Viral Bajaria [viral.baja...@gmail.com] Sent: Monday, July 08, 2013 3:04 AM To: user@hbase.apache.org Subject: optimizing block cache requests + eviction Hi, TL;DR; Trying to make a case for making the block eviction strategy smart and to not evict remote blocks more frequently and make the requests more smarter. The question here comes after I debugged the issue that I was having with random region servers hitting high load averages. I initially thought the problem was hardware related i.e. bad disk or network since the wait I/O was too high but it was a combination of things. I figured with SCR (short circuit read) ON the datanode should almost never show high amount of block requests from the local regionservers. So my starting point for debugging was the datanode since it was doing a ton of I/O. The clienttrace logs helped me figure out which RS nodes were making block requests. I hacked up a script to report which blocks are being requested and how many times per minute. I found that some blocks were being requested 10+ times in a minute and over 2000 times in an hour from the same regionserver. This was causing the server to do 40+MB/s on reads alone. That was on the higher side, the average was closer to 100 or so per hour. Now why did I end up in such a situation. It happened due to the fact that I added servers to the cluster and rebalanced the cluster. At the same time I added some drives and also removed the offending server in my setup. This caused some of the data to not be co-located with the regionservers. Given that major_compaction was disabled and it would not have run for a while (atleast on some tables) these block requests would not go away. One of my regionservers was totally overwhelmed. I made the situation worse when I removed the server that was under heavy load with the assumption that it's a hardware problem with the box without doing a deep dive (doh!). Given that regionservers will be added in the future, I expect block locality to go down till major_compaction runs. Also nodes can go down and cause this problem. So I started thinking of probable solutions, but first some observations. *Observations/Comments* - The surprising part was the regionservers were trying to make so many requests for the same block in the same minute (let alone hour). Could this happen because the original request took a few seconds and so the regionserver re-requested ? I didn't see any fetch errors in the regionserver logs for blocks. - Even more strange; my heap size was at 11G and the time when this was happening, the used heap was at 2-4G. I would have expected the heap to grow higher than that since the blockCache should be using atleast 40% of the available heap space. - Another strange thing that I observed was, the block was being requested from the same datanode every single time. *Possible Solution/Changes* - Would it make sense to give remote blocks higher priority over the local blocks that can be read via SCR and not let them get evicted if there is a tie in which block to evict ? - Should we throttle the number of outgoing requests for a block ? I am not sure if my firewall caused some issue but I wouldn't expect multiple block fetch requests in the same minute. I did see a few RST packets getting dropped at the firewall but I wasn't able to trace the problem was due to this. - We have 3 replicas available, shouldn't we request from the other datanode if one might take a lot of time ? The amount of time it took to read a block went up when the box was under heavy load, yet the re-requests were going to the same one. Is this something that is available on the DFSClient and can we exploit it ? - Is it possible to migrate a region to a server which has higher number of blocks available for it ? We don't need to make this automatic, but we could provide a command that could be invoked manually to assign a region to a specific regionserver. Thoughts ? Thanks, Viral Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.