Inline.
J-D
1. I assume you've seen this benchmark by Yahoo (
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf and
http://www.brianfrankcooper.net/pubs/ycsb.pdf). They show three main
problems: latency goes up quite significantly when doing more operations,
operations/sec are capped at about half of the other tested platforms and
adding new nodes interrupts the normal operation of the cluster for a while.
Do you consider these results a problem and if so are there any plans to
address them?
Please see our answer
http://www.search-hadoop.com/m?id=7c962aed1002091610q14f2d6f0gc420ddade319f...@mail.gmail.com
2. While running our tests (most were done using 0.20.2) we've had a few
incidents where a table went into "transition" without ever going out of it.
We had to restart the cluster to release the stuck tables. Is this a common
issue?
0.20.3 has a much better story, 0.20.4 will include even more reliability fixes.
3. If I understand correctly then any major upgrade requires completely
shutting down the cluster while doing the upgrade as well as deploying a new
version of the application compiled with the new version client? Did I get
it correctly? Is there any strategy for upgrading while the cluster is still
running?
Lots of different reasons why: Hadoop RPC is versionned, a new Hadoop
major version requires filesystem upgrades, etc...
So for HBase, you currently can do rolling restarts between minor
versions until told otherwise (in the release notes). See
http://wiki.apache.org/hadoop/Hbase/RollingRestart
Also Hadoop RPC will probably be replaced in the future with Avro and
by then all releases should be backward compatible (we hope).
4. This is more a bug report than a question but it seems that in 0.20.3
the master server doesn't stop cleanly and has to be killed manually. Is
someone else seeing it too?
Can you provide more details? Logs and stack traces appreciated.
5. Are there any performance benchmarks for the Thrift gateway? Do you
have an estimate of the performance penalty of using the gateway compared to
using the native API?
The good thing with thrift servers is that those they have long lived
clients so their cache is always full and HotSpot does it's magic. In
our tests (we use Thrift servers in production here at StumbleUpon),
it's maybe adding 1 or 2 ms per request...
6. Right now, my biggest concern about HBase is its administration
complexity and cost. If anyone can share their experience that would be a
huge help. How many serves do you have in the cluster? How much ongoing
effort does it take to administrate it? What uptime levels are you seeing
(including upgrades)? Do you have any good strategy for running one cluster
across two data centers, or replicating between two clusters in two
different DCs? Did you have any serious problems/crashes/downtime with
HBase?
HBase does require a knowledgeable admin, but which DB doesn't if used
on a very large scale? We have a full time DBA here for our mysql
clusters but the difference is that those are easier to find than
HBase admins, right? So some stats that we can make public:
- We have a production cluster, another one for processing and a few
other for dev and testing (we have 3 HBase committers on staff so...
we need machines!). The production clusters have somewhat beefy nodes,
i7s with 24GB of RAM and 4x1TB in JBOD. None has more than 40 nodes.
- Cluster replication is actually a feature I'm working on. See
http://issues.apache.org/jira/browse/HBASE-1295. We currently have 2
clusters replicating to each other, each hosted in a different city
and around 50M rows are sent each day (we aren't replicating
everything tho).
- We did have some good crashes, we even run unofficial releases
sometimes, but since we are very knowledgeable we are able to fix
those and we always get them committed.
- I can't disclose our uptime since it would give hints about uptime
of one of our product. I can say tho that it's getting better with
every release but eh, HBase is still very bleeding edge.
Thanks a lot,
Eran Kutner