Re: Inconsistent scan performance with caching set to 1

2012-08-29 Thread Wayne
This is basically a read bug/performance problem. The execution path followed when the caching is used up is not consistent with the initial execution path/performance. Can anyone help shed light on this? Was there any changes to 0.94 to introduce this (we have not tested on other versions)? Any

Re: 0.92 Max Row Size

2012-01-23 Thread Wayne
21, 2012, at 5:34 AM, Wayne wav...@gmail.com wrote: Sorry but it would be too hard for us to be able to provide enough info in a Jira to accurately reproduce. Our read problem is through thrift and has everything to do with the row just being too big to bring back in its entirety (13

Re: 0.92 Max Row Size

2012-01-21 Thread Wayne
wrote: On Fri, Jan 20, 2012 at 11:43 AM, Wayne wav...@gmail.com wrote: Does 0.92 support a significant increase in row size over 0.90.x? With 0.90.4 we have seen writes start choking at 30 million cols/row and reads start choking at 10 million cols/row. Can we assume these numbers will go

Re: pdflush 100% cpu

2012-01-19 Thread Wayne
It was the kernel...thanks for the help. We had considered but could not really believe a slight kernel version difference would cause this. Thanks On Wed, Jan 18, 2012 at 11:47 PM, Stack st...@duboce.net wrote: On Wed, Jan 18, 2012 at 8:40 AM, Wayne wav...@gmail.com wrote: We have set up

Row+Col Range Read/Scan

2011-08-10 Thread Wayne
As we load more and more data into HBase we are seeing the millions of columns to be a challenge for us. We have some very wide rows and we are taking 12-15 seconds to read those rows. Since HBase does not sort columns and thereby can not support a scan of columns we are really seeing some serious

Re: Row+Col Range Read/Scan

2011-08-10 Thread Wayne
I think you are right in that Thrift is all we see and it is very limited. Comments in-line. On Wed, Aug 10, 2011 at 8:33 PM, Stack st...@duboce.net wrote: On Wed, Aug 10, 2011 at 2:39 AM, Wayne wav...@gmail.com wrote: As we load more and more data into HBase we are seeing the millions

Re: hbck -fix

2011-07-11 Thread Wayne
in this case. Thanks!! On Fri, Jul 8, 2011 at 6:12 PM, Stack st...@duboce.net wrote: Going by the below where hdfs reports 173 lost blocks, I think the only recourse is as you suggest in the below Wayne, a recovery mode that goes through and sees what is out there and rebuilds the meta based

Re: hbck -fix

2011-07-03 Thread Wayne
@hbase.apache.org Cc: Sent: Sunday, July 3, 2011 12:39 AM Subject: Re: hbck -fix Wayne, Did you by chance have your NameNode configured to write the edit log to only one disk, and in this case only the root volume of the NameNode host? As I'm sure you are now aware, the NameNode's

Re: hbck -fix

2011-07-02 Thread Wayne
direct to fix what we need. Thanks. On Fri, Jul 1, 2011 at 5:03 PM, Stack st...@duboce.net wrote: On Fri, Jul 1, 2011 at 8:32 AM, Wayne wav...@gmail.com wrote: We are not in production so we have the luxury to start again, but the damage to our confidence is severe. Is there work going

Re: hbck -fix

2011-07-02 Thread Wayne
tried running check_meta.rb with --fix ? On Sat, Jul 2, 2011 at 9:19 AM, Wayne wav...@gmail.com wrote: We are running 0.90.3. We were testing the table export not realizing the data goes to the root drive and not HDFS. The export filled the master's root partition. The logger had issues

Re: hbck -fix

2011-07-02 Thread Wayne
but some of the .META. table entries were still there. Finally this afternoon we reformatted the entire cluster. Thanks. On Sat, Jul 2, 2011 at 5:25 PM, Stack st...@duboce.net wrote: On Sat, Jul 2, 2011 at 9:55 AM, Wayne wav...@gmail.com wrote: It just returns a ton of errors (import: command

hbck -fix

2011-07-01 Thread Wayne
We had some serious issues from the hmaster running out of space on the root partition. We were getting region server not found errors on the client which then turned to client errors servers have issues etc. I ran the hbck command and found 14 inconsistencies. There were files in hdfs not used

Re: hbck -fix

2011-07-01 Thread Wayne
. On Fri, Jul 1, 2011 at 11:32 AM, Wayne wav...@gmail.com wrote: We had some serious issues from the hmaster running out of space on the root partition. We were getting region server not found errors on the client which then turned to client errors servers have issues etc. I ran the hbck

responseTooLarge

2011-06-07 Thread Wayne
We are seeing responseTooLarge for: next... errors in our region server logs. If I understand correctly is this caused by opening a scanner and the rows are too big to be returned? If the scan batch size is set to 1 this tells me these are rows too big to actually read. Is this correct? Why would

Re: responseTooLarge

2011-06-07 Thread Wayne
We also see a lot of responseTooLarge for: multi. Not sure what this is... On Tue, Jun 7, 2011 at 9:50 AM, Wayne wav...@gmail.com wrote: We are seeing responseTooLarge for: next... errors in our region server logs. If I understand correctly is this caused by opening a scanner and the rows

Re: mslab enabled jvm crash

2011-06-06 Thread Wayne
I had 25 sec CMF failure this morning...looks like bulk inserts are required along with possibly weekly/daily scheduled rolling restarts. Do most production clusters run rolling restarts on a regular basis to give the JVM a fresh start? Thanks. On Thu, Jun 2, 2011 at 1:56 PM, Wayne wav

Node Monitoring

2011-06-06 Thread Wayne
Are there any recommended methods/scripts to monitor nodes via nagios? It would be best to have a simple nagios call to check hadoop, hbase, thrift separately and alarm if one of them is awol (and not have the script cause damage like I have read with thrift). For example our friendly CMF issues

Re: mslab enabled jvm crash

2011-06-02 Thread Wayne
I have finally been able to spend enough time to digest/test all recommendations and get this under control. I wanted to thank Stack, Jack Levin, and Ted Dunning for their input. Basically our memory was being pushed to the limit and the JVM does not like/can not handle this. We are successfully

Re: mslab enabled jvm crash

2011-06-02 Thread Wayne
Our storefileindex was pushing 3g. We used the hfile tool to see that we had very large keys (50-70 bytes) and small values (5-7 bytes). Jack pointed me to a great Jira about this: https://issues.apache.org/jira/browse/HBASE-3551 . We HAD to increase from the default and we picked 256k to reduce

Re: mslab enabled jvm crash

2011-06-02 Thread Wayne
as regions are compacted. We have enabled the memstore MSLAB option. Not sure it is relevant but we have a 5G region size and 256m memstore flush size. Thanks. On Thu, Jun 2, 2011 at 11:48 AM, Stack st...@duboce.net wrote: Thanks for writing back to the list Wayne. Hopefully this message hits

Re: mslab enabled jvm crash

2011-05-26 Thread Wayne
load for long enough to gain confidence. Thanks. On Thu, May 26, 2011 at 3:43 AM, Jack Levin magn...@gmail.com wrote: Wayne, I think you are hitting fragmentation, how often do you flush? Can you share memstore flush graphs? Here is ours: http://img851.yfrog.com/img851/9814

Re: mslab enabled jvm crash

2011-05-26 Thread Wayne
logs. http://pastebin.com/kJyJHgQc Thanks. On Thu, May 26, 2011 at 9:08 AM, Wayne wav...@gmail.com wrote: Attached is our memstore size graph...not sure it will make it to the post. Ours it definitely not as gracefull as yours. You can see where we restarted last 16 hours ago. We have not had

Re: mslab enabled jvm crash

2011-05-26 Thread Wayne
: On Thu, May 26, 2011 at 9:00 AM, Wayne wav...@gmail.com wrote: Looking more closely I can see that we are still getting Concurrent Mode Failures on some of the nodes but they are only lasting for 10s so the nodes don't go away. Is this considered normal? With CMSInitiatingOccupancyFraction=65

Re: mslab enabled jvm crash

2011-05-26 Thread Wayne
On Thu, May 26, 2011 at 1:41 PM, Stack st...@duboce.net wrote: On Thu, May 26, 2011 at 6:08 AM, Wayne wav...@gmail.com wrote: I think our problem is the load pattern. Since we use a very controlled q based method to do work our Python code is relentless in terms of keeping the pressure up

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
] : 7621159K-2503625K(8111744K), 63.3195660 secs] 7798327K-2503625K(8360960K), [CMS Perm : 20128K-20106K(33580K)] icms_dc=100 , 63.8965450 secs] [Times: user=69.50 sys=0.01, real=63.89 secs] On Mon, May 23, 2011 at 12:04 PM, Stack st...@duboce.net wrote: On Mon, May 23, 2011 at 8:42 AM, Wayne wav

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
://pastebin.com/ca13aMRu http://pastebin.com/9KfRZFBW On Wed, May 25, 2011 at 1:42 PM, Todd Lipcon t...@cloudera.com wrote: Hi Wayne, Looks like your RAM might be oversubscribed. Could you paste your hbase-site.xml and hbase-env.sh files? Also looks like you have some strange GC settings

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
, 2011 at 11:08 AM, Wayne wav...@gmail.com wrote: I tried to turn off all special JVM settings we have tried in the past. Below are link to the requested configs. I will try to find more logs for the full GC. We just made the switch and on this node it has only occurred once in the scope

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
as normal. Cassandra had the exact same problems for us (plus a lot of other issues), and we all know what is common between the two. On Wed, May 25, 2011 at 2:39 PM, Ted Dunning tdunn...@maprtech.com wrote: Wayne, It should be recognized that your experiences are a bit out of the norm here

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
with writes nothing we do seems to stop it with getting paused very very frequently. I will look into the zookeeper log location, never looked at those... Thanks for the help. On Wed, May 25, 2011 at 3:23 PM, Stack st...@duboce.net wrote: On Wed, May 25, 2011 at 11:08 AM, Wayne wav...@gmail.com

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
updates? On Wed, May 25, 2011 at 2:44 PM, Wayne wav...@gmail.com wrote: What are your write levels? We are pushing 30-40k writes/sec/node on 10 nodes for 24-36-48-72 hours straight. We have only 4 writers per node so we are hardly overwhelming the nodes. Disk utilization runs at 10-20%, load

Re: mslab enabled jvm crash

2011-05-25 Thread Wayne
square shop and you would be a square one in my round one. On Wed, May 25, 2011 at 5:55 PM, Wayne wav...@gmail.com wrote: We are not a Java shop, and do not want to become one. I think to push the limits and do well with hadoop+hdfs you have to buy into Java and have deep skills there. We

mslab enabled jvm crash

2011-05-23 Thread Wayne
I have switched to using the mslab enabled java setting to try to avoid GC causing nodes to go awol but it almost appears to be worse. Below is the latest problem with the JVM apparently actually crashing. I am using 0.90.1 with an 8GB heap. Is there a recommended JVM and recommended settings to

memstore flush blocking write pause

2011-05-23 Thread Wayne
In order to reduce the total number of regions we have up'd the max region size to 5g. This has kept us below 100 regions per node but the side affect is pauses occurring every 1-2 min under heavy writes to a single region. We see the too many store files delaying flush up to 90sec warning every

Re: mslab enabled jvm crash

2011-05-23 Thread Wayne
...@maprtech.com wrote: Do you have the same problem with a more recent JVM? On Mon, May 23, 2011 at 4:52 AM, Wayne wav...@gmail.com wrote: I have switched to using the mslab enabled java setting to try to avoid GC causing nodes to go awol but it almost appears to be worse. Below is the latest problem

Re: mslab enabled jvm crash

2011-05-23 Thread Wayne
data nodes, but you know what they say about assumptions... From: tdunn...@maprtech.com Date: Mon, 23 May 2011 07:33:05 -0700 Subject: Re: mslab enabled jvm crash To: user@hbase.apache.org Do you have the same problem with a more recent JVM? On Mon, May 23, 2011 at 4:52 AM, Wayne

Re: mslab enabled jvm crash

2011-05-23 Thread Wayne
: u17 was release a year and a half ago. Latest is u25 (we run u24). What kind of 'crash' are you seeing? What is your OS? St.Ack On Mon, May 23, 2011 at 8:19 AM, Wayne wav...@gmail.com wrote: Zookeeper is not on the same nodes...and yes we could up to 120 seconds but then we are back

Re: memstore flush blocking write pause

2011-05-23 Thread Wayne
is a region server log snippet for this occurring 2x in less than a 2 minute period. http://pastebin.com/CxAQSXTt On Mon, May 23, 2011 at 11:33 AM, Stack st...@duboce.net wrote: On Mon, May 23, 2011 at 6:40 AM, Wayne wav...@gmail.com wrote: In order to reduce the total number of regions we

Re: Max Table Count

2011-05-19 Thread Wayne
tables as you like. I do not believe there a cost to having more tables. St.Ack On Wed, May 18, 2011 at 5:54 AM, Wayne wav...@gmail.com wrote: How many tables can a cluster realistically handle or how many tables/node can be supported? I am looking for a realistic idea of whether a 10 node

Max Table Count

2011-05-18 Thread Wayne
How many tables can a cluster realistically handle or how many tables/node can be supported? I am looking for a realistic idea of whether a 10 node cluster can support 100 or even 500 tables. I realize it is recommended to have a few tables at most (and to use the row key to add everything to one

Re: Cluster Size/Node Density

2011-02-19 Thread Wayne
What JVM is recommended for the new memstore allocator? We swtiched from u23 back to u17 which helped a lot. Is this optimized for a specific JVM or does it not matter? On Fri, Feb 18, 2011 at 5:46 PM, Todd Lipcon t...@cloudera.com wrote: On Fri, Feb 18, 2011 at 12:10 PM, Jean-Daniel Cryans

Re: Cluster Size/Node Density

2011-02-18 Thread Wayne
, 2011 at 2:15 AM, M. C. Srivas mcsri...@gmail.com wrote: I was reading this thread with interest. Here's my $.02 On Fri, Dec 17, 2010 at 12:29 PM, Wayne wav...@gmail.com wrote: Sorry, I am sure my questions were far too broad to answer. Let me *try* to ask more specific questions. Assuming

Re: Most useful metrics?

2011-02-09 Thread Wayne
Compaction Queue size usually explains a lot. That along with load and disk utilization are what I use the most. I am definitely interested in what others use, especially metrics that indicate early for problems. Thanks. On Wed, Feb 9, 2011 at 1:42 PM, Tim Sell trs...@gmail.com wrote: What do

Re: .oldlogs Cleanup

2011-02-03 Thread Wayne
it be that your region servers are creating them faster than that? In any case, it's safe to delete them but not the folder itself. Also please open a jira and assign it to me. J-D On Jan 29, 2011 5:22 PM, Wayne wav...@gmail.com wrote: The current log folder in hdfs (.logs) seems to always

Region Balancing

2011-02-02 Thread Wayne
I know there were some changes in .90 in terms of how region balancing occurs. Is there a resource somewhere that describes the options for the configuration? Per Jonathan Gray's recommendation we are trying to keep our region count down to 100 per region server (we are up to 5gb region size).

Re: Region Balancing

2011-02-02 Thread Wayne
2, 2011 at 7:51 PM, Wayne wav...@gmail.com wrote: I know there were some changes in .90 in terms of how region balancing occurs. Is there a resource somewhere that describes the options for the configuration? Per Jonathan Gray's recommendation we are trying to keep our region count down

Re: Region Balancing

2011-02-02 Thread Wayne
hbase.master.startup.retainassign=false works like a charm. After a restart all tables are scattered across all region servers. Thanks! On Wed, Feb 2, 2011 at 4:06 PM, Stack st...@duboce.net wrote: On Wed, Feb 2, 2011 at 8:41 PM, Wayne wav...@gmail.com wrote: The regions counts are the same

Open Scanner Latency

2011-01-31 Thread Wayne
After doing many tests (10k serialized scans) we see that on average opening the scanner takes 2/3 of the read time if the read is fresh (scannerOpenWithStop=~35ms, scannerGetList=~10ms). The second time around (1 minute later) we assume the region cache is hot and the open scanner is much faster

Re: Open Scanner Latency

2011-01-31 Thread Wayne
vs cold you are seeing below. -ryan On Mon, Jan 31, 2011 at 1:38 PM, Wayne wav...@gmail.com wrote: After doing many tests (10k serialized scans) we see that on average opening the scanner takes 2/3 of the read time if the read is fresh (scannerOpenWithStop=~35ms, scannerGetList=~10ms

Re: Open Scanner Latency

2011-01-31 Thread Wayne
On Mon, Jan 31, 2011 at 4:54 PM, Stack st...@duboce.net wrote: On Mon, Jan 31, 2011 at 1:38 PM, Wayne wav...@gmail.com wrote: After doing many tests (10k serialized scans) we see that on average opening the scanner takes 2/3 of the read time if the read is fresh (scannerOpenWithStop=~35ms

Re: Open Scanner Latency

2011-01-31 Thread Wayne
in a LRU manner, and things would get slow again. Does this make sense to you? On Mon, Jan 31, 2011 at 1:50 PM, Wayne wav...@gmail.com wrote: We have heavy writes always going on so there is always memory pressure. If the open scanner reads the first block maybe that explains the 8ms

Re: Open Scanner Latency

2011-01-31 Thread Wayne
the first block it needs. This is done during the 'openScanner' call, and would explain the latency you are seeing in openScanner. -ryan On Mon, Jan 31, 2011 at 2:17 PM, Wayne wav...@gmail.com wrote: I assume BLOCKCACHE = 'false' would turn this off? We have turned off cache on all

Re: .oldlogs Cleanup

2011-01-29 Thread Wayne
The current log folder in hdfs (.logs) seems to always keep to 32 log files max per region server or the last hour. It is the .oldlogs folder that is growing crazy large. I added the setting for hbase.master.logcleaner.ttl and switched it from 7 days to 2 days and restarted yesterday and no

.oldlogs Cleanup

2011-01-28 Thread Wayne
How is the .oldlogs folder cleaned up? My cluster size kept going up and I looked closely and realized that 91% of the space was going to .oldlogs that do not appear to be archived. This adds up to 12.5TB with rf=3 in the 4 days we have been up with .90. How can this be configured to be cleaned

SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
We have got .90 up and running well, but again after 24 hours of loading a node went down. Under it all I assume it is a GC issue, but the GC logging rolls every 60 minutes so I can never see logs from 5 hours ago (working on getting Scribe up to solve that). Most of our issues are a node being

Re: Cluster Wide Pauses

2011-01-27 Thread Wayne
during balancing and splits. Wayne, have you confirmed in your RegionServer logs that the pauses are associated with splits or region movement, and that you are not seeing the blocking store files issue? JG -Original Message- From: c...@tarnas.org [mailto:c

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
running hbase 0.20.6, I found none. Both use cdh3b2 hadoop. On Thu, Jan 27, 2011 at 6:48 AM, Wayne wav...@gmail.com wrote: We have got .90 up and running well, but again after 24 hours of loading a node went down. Under it all I assume it is a GC issue, but the GC logging rolls every 60

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
: On Thu, Jan 27, 2011 at 6:48 AM, Wayne wav...@gmail.com wrote: We have got .90 up and running well, but again after 24 hours of loading a node went down. Under it all I assume it is a GC issue, but the GC logging rolls every 60 minutes so I can never see logs from 5 hours ago (working

.90 Upgrade Problems

2011-01-24 Thread Wayne
We tried to upgrade to .90 and got 2x the nodes listed and saw none of our old regions showing up in the counts. We assumed the upgrade was not easy so we just re-formated the HDFS thinking it would fix everything and still see the same problem. Any suggestions? The duplicate region servers listed

Re: .90 Upgrade Problems

2011-01-24 Thread Wayne
Is reverse dns a requirement with .90? It was not with .89.xxx On Mon, Jan 24, 2011 at 3:17 PM, Wayne wav...@gmail.com wrote: We tried to upgrade to .90 and got 2x the nodes listed and saw none of our old regions showing up in the counts. We assumed the upgrade was not easy so we just re

Re: .90 Upgrade Problems

2011-01-24 Thread Wayne
PM, Wayne wav...@gmail.com wrote: Is reverse dns a requirement with .90? It was not with .89.xxx On Mon, Jan 24, 2011 at 3:17 PM, Wayne wav...@gmail.com wrote: We tried to upgrade to .90 and got 2x the nodes listed and saw none of our old regions showing up in the counts. We

Re: How to delete a table manually

2011-01-21 Thread Wayne
seemed to have any affect)... Thanks. On Fri, Jan 21, 2011 at 1:47 PM, Stack st...@duboce.net wrote: On Fri, Jan 21, 2011 at 4:51 AM, Wayne wav...@gmail.com wrote: After several hours I have figured out how to get the Disable command to work and how to delete manually, but in the process

Re: How to delete a table manually

2011-01-21 Thread Wayne
What is the difference between .90 and .90_master_rewrite? Thanks. On Fri, Jan 21, 2011 at 2:29 PM, Lars George lars.geo...@gmail.com wrote: Hi Wayne, 0.90.0 is out. Get it while it's hot from the HBase home page. Lars On Jan 21, 2011, at 20:22, Wayne wav...@gmail.com wrote: I

How to delete a table manually

2011-01-20 Thread Wayne
I need to delete some tables and I am not sure the best way to do it. The shell does not work. The disable command says it runs ok but every time I run drop or truncate I get an exception that says the table is not disabled. The UI shows it as disabled but truncate/drop still do not work. I have

Re: Recommended Node Size Limits

2011-01-15 Thread Wayne
Not everyone is looking for a distributed memcache. Many of us are looking for a database that scales up and out, and for that there is only one choice. HBase does auto partitioning with regions; this is the genius of the original bigtable design. Regions are logical units small enough to be fast

Re: Cluster Wide Pauses

2011-01-14 Thread Wayne
We have not found any smoking gun here. Most likely these are region splits on a quickly growing/hot region that all clients get caught waiting for. On Thu, Jan 13, 2011 at 7:49 AM, Wayne wav...@gmail.com wrote: Thank you for the lead! We will definitely look closer at the OS logs. On Thu

Re: Cluster Wide Pauses

2011-01-13 Thread Wayne
Thank you for the lead! We will definitely look closer at the OS logs. On Thu, Jan 13, 2011 at 6:59 AM, Tatsuya Kawano tatsuya6...@gmail.comwrote: Hi Wayne, We are seeing some TCP Resets on all nodes at the same time, and sometimes quite a lot of them. Have you checked this article

Re: Cluster Wide Pauses

2011-01-12 Thread Wayne
loads. We see RS activity drop around memstore flushes, compactions and especially splits. Friso On 11 jan 2011, at 23:57, Wayne wrote: What is shared across all nodes that could stop everything? Originally I suspected the node with the .META. table and GC pauses but could never find

Re: Cluster Wide Pauses

2011-01-12 Thread Wayne
Added: https://issues.apache.org/jira/browse/HBASE-3438. On Wed, Jan 12, 2011 at 11:40 AM, Wayne wav...@gmail.com wrote: We are using 0.89.20100924, r1001068 We are seeing see it during heavy write load (which is all the time), but yesterday we had read load as well as write load and saw

Re: Cluster Wide Pauses

2011-01-12 Thread Wayne
, St.Ack On Wed, Jan 12, 2011 at 9:03 AM, Wayne wav...@gmail.com wrote: Added: https://issues.apache.org/jira/browse/HBASE-3438. On Wed, Jan 12, 2011 at 11:40 AM, Wayne wav...@gmail.com wrote: We are using 0.89.20100924, r1001068 We are seeing see it during heavy write load (which

Re: CPU Wait Problems

2011-01-11 Thread Wayne
haven't seen the same problems after down rev'ing to jdk1.6u16. -brent On Mon, Jan 10, 2011 at 12:59 PM, Wayne wav...@gmail.com wrote: We had a node last night go awol and got stuck in permanent 50% CPU wait time. The node also steadily shot up the load to 400 before we saw it and had

Re: CPU Wait Problems

2011-01-11 Thread Wayne
this also with evil disk controllers on the edge of dying. On Tue, Jan 11, 2011 at 12:10 PM, Wayne wav...@gmail.com wrote: Thanks a lot for the heads up on this. We have only seen this once, but if we start seeing it more we will definitely try to go back to a previous version. We

Cluster Wide Pauses

2011-01-11 Thread Wayne
We have very frequent cluster wide pauses that stop all reads and writes for seconds. We are constantly loading data to this cluster of 10 nodes. These pauses can happen as frequently as every minute but sometimes are not seen for 15+ minutes. Basically watching the Region server list with

HDFS Errors Deleting Blocks

2011-01-10 Thread Wayne
We are seeing a lot of warnings and errors in the HDFS logs (examples below). I am looking for any help or recommendations anyone can provide. It almost looks like compaction/splits occur and errors occur while looking for the old data. Could this be true? Are these warnings and errors normal?

Re: Breaking down an HBase read through thrift

2011-01-10 Thread Wayne
Thank you for the help. Below are a few more questions. On Mon, Jan 10, 2011 at 1:41 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Inline. *Region/Meta Cache * Often times the region list is not hot and thrift has to talk to the meta table. We have 6k+ regions and growing quickly

Re: HDFS Errors Deleting Blocks

2011-01-10 Thread Wayne
you build a history. Is hbase trying to use files its already deleted elsewhere? St.Ack On Mon, Jan 10, 2011 at 9:20 AM, Wayne wav...@gmail.com wrote: We are seeing a lot of warnings and errors in the HDFS logs (examples below). I am looking for any help or recommendations anyone can

CPU Wait Problems

2011-01-10 Thread Wayne
We had a node last night go awol and got stuck in permanent 50% CPU wait time. The node also steadily shot up the load to 400 before we saw it and had to hard reboot. Besides that all other ganglia metrics flat-lined. Is this some sort of bizarre kernal problem? We are using xfs with std settings.

Re: HDFS Errors Deleting Blocks

2011-01-10 Thread Wayne
of hbase are you on? Thanks Wayne, St.Ack On Mon, Jan 10, 2011 at 11:34 AM, Wayne wav...@gmail.com wrote: Yes it appears blocks are being referenced that have been deleted hours before. The logs below are based on tracing that history. On Mon, Jan 10, 2011 at 2:31 PM, Stack st

Breaking down an HBase read through thrift

2011-01-08 Thread Wayne
I am trying to understand exactly what an HBase read is doing through Thrift (python) so that we can know what to change to improve our performance (read latency). We have turned off all cache to make testing consistent. *Region/Meta Cache * Often times the region list is not hot and thrift has

Re: Node Shutdown Problems

2011-01-07 Thread Wayne
. On Fri, Jan 7, 2011 at 1:09 AM, Stack st...@duboce.net wrote: On Thu, Jan 6, 2011 at 5:46 AM, Wayne wav...@gmail.com wrote: I had another node go down last night. No load at the time, it just seems it had issues and shut itself down. Any help would be greatly appreciated. Why would

regionserver.Store: Not in setorg.apache.hadoop.hbase.regionserver.StoreScanner

2011-01-07 Thread Wayne
I see the message below as often as every few minutes. It appears to occur after compaction begins. Is this normal? Is it an indication of bigger issues? This is after having upped our xceivers. WARN org.apache.hadoop.hbase.regionserver.Store: Not in

Node Shutdown Problems

2011-01-06 Thread Wayne
I had another node go down last night. No load at the time, it just seems it had issues and shut itself down. Any help would be greatly appreciated. Why would the file system go away? Is this an hadoop problem or a hardware problem or ?? Here is a sampling of the logs. Please let me know what

Re: JVM OOM

2011-01-05 Thread Wayne
, 2011 at 12:13 PM, Wayne wav...@gmail.com wrote: It was carrying ~9k writes/sec and has been for the last 24+ hours. There are 500+ regions on that node. I could not find the heap dump (location?) but we do have some errant big rows that have crashed before. When we query those big rows thrift

Re: JVM OOM

2011-01-05 Thread Wayne
of the hprof is usually where the program was launched from (check $HBASE_HOME dir). St.Ack On Wed, Jan 5, 2011 at 11:24 AM, Wayne wav...@gmail.com wrote: Pretty sure this is compaction. The same node OOME again along with another node after starting compaction. Like cass* .6 I guess hbase can

Python + TBinaryProtocolAccelerated

2011-01-04 Thread Wayne
Having worked with the other java/thrift based nosql solution we have been using Thrift Accelerated Protocol and it works great. It is very fast and we have seen 3-4x performance improvement on some read operations (wide rows). We have never seen this advertised or referrenced with any hbase

CMF NodeIsDeadException

2011-01-03 Thread Wayne
After heavily loading a 10 node cluster for 3-4 days I got a concurrent mode failure of 53 seconds followed by a NodeIsDeadException which caused the node to be shut down. Is there is a timeout that can be increased so this does not occur in the future? From my experience with Cassandra Concurrent

Re: CMF NodeIsDeadException

2011-01-03 Thread Wayne
, Wayne wav...@gmail.com wrote: Any help or suggestions would be appreciated. Parnew was getting large and taking too long ( 100ms) so I will try to limit the size with the suggestion from the performance tuning page (-XX:NewSize=6m -XX:MaxNewSize=6m). The CMS concurrent mode failure

Re: CMF NodeIsDeadException

2011-01-03 Thread Wayne
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:NewRatio=3 -XX:MaxTenuringThreshold=1 On Mon, Jan 3, 2011 at 5:05 PM, Stack st...@duboce.net wrote: On Mon, Jan 3, 2011 at 12:50 PM, Wayne wav...@gmail.com wrote: We have an 8GB heap. What should newsize be? I

.META. Table

2011-01-03 Thread Wayne
We are finding that the node that is responsible for the .META. table is going in GC storms causing the entire cluster to go AWOL until it recovers. Isn't the master supposed to serve up the .META. table? Is it possible to Pin this table somewhere that only handles this? Our master server and

Re: Read/Write Performance

2011-01-02 Thread Wayne
on). BTW, are you using CMS on a 8GB Heapsize JVM and experiencing a 4 mins pause? That sounds a lot. On Thu, Dec 30, 2010 at 1:51 PM, Wayne wav...@gmail.com wrote: Lesson learned...restart thrift servers *after* restarting hadoop+hbase. On Thu, Dec 30, 2010 at 3:39 PM, Wayne wav

Re: Read/Write Performance

2010-12-30 Thread Wayne
) memstore.flush.size = 268435456 (256MB = 4x default) hregion.max.filesize = 1073741824 (1GB = 4x default) *Table* alter 'xxx', METHOD = 'table_att', DEFERRED_LOG_FLUSH = true On Wed, Dec 29, 2010 at 12:55 AM, Stack st...@duboce.net wrote: On Mon, Dec 27, 2010 at 11:47 AM, Wayne wav...@gmail.com

Re: Read/Write Performance

2010-12-30 Thread Wayne
and configuration. How many concurrent writer processes are you running? Thanks, Michael On Thu, Dec 30, 2010 at 8:51 AM, Wayne wav...@gmail.com wrote: We finally got our cluster up and running and write performance looks very good. We are getting sustained 8-10k writes/sec/node on a 10 node

Re: Read/Write Performance

2010-12-30 Thread Wayne
Lesson learned...restart thrift servers *after* restarting hadoop+hbase. On Thu, Dec 30, 2010 at 3:39 PM, Wayne wav...@gmail.com wrote: We have restarted with lzop compression, and now I am seeing some really long and frequent stop the world pauses of the entire cluster. The load requests

Re: Read/Write Performance

2010-12-27 Thread Wayne
, Dec 27, 2010 at 1:49 PM, Stack st...@duboce.net wrote: On Fri, Dec 24, 2010 at 5:09 AM, Wayne wav...@gmail.com wrote: We are in the process of evaluating hbase in an effort to switch from a different nosql solution. Performance is of course an important part of our evaluation. We are a python

Re: Read/Write Performance

2010-12-26 Thread Wayne
On Fri, Dec 24, 2010 at 5:09 AM, Wayne wav...@gmail.com wrote: We are in the process of evaluating hbase in an effort to switch from a different nosql solution. Performance is of course an important part of our evaluation. We are a python shop and we are very worried that we can not get any

Re: Cluster Size/Node Density

2010-12-20 Thread Wayne
will use HDFS pread instead of seek/read. For this application, you absolutely must be using pread. Good luck. I'm interested in seeing how you can get HBase to perform, we are here to help if you have any issues. JG -Original Message- From: Wayne [mailto:wav...@gmail.com] Sent

Cluster Size/Node Density

2010-12-17 Thread Wayne
much data is really too much on an hbase data node? Any help or advice would be greatly appreciated. Thanks Wayne

Re: Cluster Size/Node Density

2010-12-17 Thread Wayne
PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Hi Wayne, This question has such a large scope but is applicable to such a tiny subset of workloads (eg yours) that fielding all the questions in details would probably end up just wasting everyone's cycles. So first I'd like to clear up some

Re: Cluster Size/Node Density

2010-12-17 Thread Wayne
amount of activity in this area (optimizing HDFS for random reads) and lots of good ideas. HDFS-347 would probably help tremendously for this kind of high random read rate, bypassing the DN completely. JG -Original Message- From: Wayne [mailto:wav...@gmail.com] Sent: Friday

Re: hbase evaluation questions

2010-07-15 Thread Wayne
into the 100s of tables we might very well set up totally different clusters to handle different groups of customers. Thanks. On Thu, Jul 15, 2010 at 11:47 PM, Gary Helmling ghelml...@gmail.com wrote: On Wed, Jul 14, 2010 at 1:25 AM, Wayne wav...@gmail.com wrote: 1) How can hbase be configured