Regarding the test: - Try to only keep one HBaseAdmin, one HTablePool and always reuse the same conf between tests, creating a new HBA or HTP creates a new HBaseConfiguration thus a new connection. Use methods like setUpBeforeClass. Another option is to close the connection once you used those classes and the close the first one in tearDown that you created in setUp. Right now I can count 25 connections being created in this test (I know it stucks, it's a regression in 0.90) - The fact that you are creating new HTablePools in do* means you are re-creating new HTables for almost every request you are doing and that's a pretty expensive operation. Again, keeping only a single instance will help a lot.
That's the most obvious stuff I saw. J-D On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis <gstat...@traackr.com> wrote: > On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote: > >> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis >> <gstat...@traackr.com> wrote: >> > We have several unit tests that have started mysteriously failing in >> random >> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those >> > tests used to run against 0.89 and never failed before. They also run OK >> on >> > our local macbooks. On EC2, we are seeing lots of issues where the setup >> > data is not being persisted in time for the tests to assert against them. >> > They are also not always being torn down properly. >> > >> >> These are your tests George dependent on HBase. What are they asking >> of HBase? You are spinning up a cluster and then the takedown is not >> working? Want to pastebin some log? We might see something. >> > > > It's not practical to paste all the secondary-indexing code we have in > place. It's very likely that there is an issue in our code though, so I > don't want to send folks down a rabbit hole. I just wanted to validate that > there are no new configs in 0.90 (from 0.89) that could affect read/write > consistency. > > I created a test that simulates what most of our secondary-indexing code > does: > > http://pastebin.com/M9qKv87u > > It's a simplified version and of course, this one does not fail, or rather, > I have not been able to make it fail in the same way. The only thing that > I've hit with this test in pseudo-distributed mode > is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able to > force it past it. The issue we are seeing does not throw any errors in any > of the master/regionserver/zookeeper logs, so, right now, all indications > are that the problem is on our side. I just need to diff deeper. > > BTW, we are not spinning up a temporary mini-cluster to test; instead, we > have a dedicated dev pseudo-distributed machine against which our CI tests > run. That's the environment that is presenting issues at the moment. Again, > the odd part is that we have setup our local instances the same way as our > dev pseudo-distributed machine and the tests pass. The differences are that > we run on macs and the dev instance is on EC2. > > >> >> > We first started seeing issues running our hudson build on the same >> machine >> > as the hbase pseudo-cluster. We figured that was putting too much load on >> > the box, so we created a separate large instance on EC2 to host just the >> > 0.90 stack. This migration nearly quadrupled the number of unit tests >> > failing at times. The only difference between for first and second CI >> setup >> > is the network in between. >> > >> >> Yeah. EC2. But we should be able to manage with a flakey network anyways. >> > > Just wanted to make sure that this was indeed the case. > > >> >> >> > Before we start tearing down our code line by line, I'd like to see if >> there >> > are latency related configuration tweaks we could try to make the setup >> > more resilient to network lag. Are there any hbase/zookepper settings >> that >> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP >> > in hbase-env.sh . Can that help? >> > >> >> You've seen that hbase uses a different config. when it runs tests; >> its in src/tests/resources/hbase-site.xml. >> >> But if stuff used to work on 0.89 w/ old config. this is probably not it. >> > > I reverted all our configs back to default but the issue remains. I'll take > a look at the test config and see if any of those settings may help out. > From what I can gather at first glance, the test settings are more > aggressive actually, so they seem even less tolerant of delays. > > Will keep digging and I'll post and update when we get somewhere. > > >> >> > Any suggestions are more than welcome. Also, the overview above may not >> be >> > enough to go on, so please let me know if I could provide more details. >> > >> >> I think pastebin of a failing test, one that used pass, with >> description (or code) of what is being done would be place to start; >> we might recognize the diff in 0.89 to 0.90. >> >> St.Ack >> >