Re: Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
On Mon, Feb 20, 2017 at 9:29 PM, Edward Capriolo 
wrote:

> Ap systems are not available in practice..
>
> Cp systems can be made highly available.
>
> Sounds like they are arguing ap is not ap, but somehow cp can be ap.
>
> Then google can label failures as 'incidents' and cp and ap are unefected.
>
> I swear foundation db claimed it solved cap, too bad foundationdb is
> unavailableexception.
>
> On Friday, February 17, 2017, Ted Yu  wrote:
>
>> Reference #8 at the end of the post is interesting.
>>
>> On Fri, Feb 17, 2017 at 9:23 AM, Robert Yokota 
>> wrote:
>>
>> > Hi,
>> >
>> > This may be helpful to those who are considering the use of HBase.
>> >
>> > https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
>> >
>>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Totally fair comparison by the way. Call out figures with from Google and
Facebook, companies with huge development budgets and data centers, teams
with tends or hundreds of developers building in house software, then
compare those deployments to a deployment of Cassandra or Riak at
Yammer.

Random thought: Compare the availability of amazon S3, build to leverage
eventual consistency, on top of other systems with eventual consistency,
and compare that with say google cloud storage...

https://cloud.google.com/products/storage/
AvailableAll storage classes offer very high availability. Your data is
accessible when you need it. Multi-Regional offers 99.95% and Regional
storage offers 99.9% monthly availability in their Service Level Agreement.
Nearline and Coldline storage offer 99% monthly availability.

What happened to those 5 9s?


Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
Ap systems are not available in practice..

Cp systems can be made highly available.

Sounds like they are arguing ap is not ap, but somehow cp can be ap.

Then google can label failures as 'incidents' and cp and ap are unefected.

I swear foundation db claimed it solved cap, too bad foundationdb is
unavailableexception.

On Friday, February 17, 2017, Ted Yu  wrote:

> Reference #8 at the end of the post is interesting.
>
> On Fri, Feb 17, 2017 at 9:23 AM, Robert Yokota  wrote:
>
> > Hi,
> >
> > This may be helpful to those who are considering the use of HBase.
> >
> > https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: How to run two data nodes on one pc?

2012-05-09 Thread Edward Capriolo
You can run two Datanodes on the same machine but it does not prove
much. It is like using two browsers instead of using tabs.

You will have to tween many configuration properties to ensure they do
not overlap on directories or ports.

"By add the other node as a slave" I guess you mean list it twice in
the slaves file. I would just avoid using the slave file all together.

Edward

On Wed, May 9, 2012 at 2:17 PM, MapHadoop  wrote:
>
> Hi,
> Im trying to get Duo core computer to run 2 data nodes on Hadoop.
> What do I have to do to add the other node as a slave?
> --
> View this message in context: 
> http://old.nabble.com/How-to-run-two-data-nodes-on-one-pc--tp33763649p33763649.html
> Sent from the HBase User mailing list archive at Nabble.com.
>


Re: HBase cluster on heterogeneous filesystems

2011-11-11 Thread Edward Capriolo
On Fri, Nov 11, 2011 at 4:42 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> I was wondering if anyone has done an experiment with HBase or HDFS/MR
> where machines in the cluster have heterogeneous underlying file systems?
> e.g.,
> * 10 nodes with xfs
> * 10 nodes with ext3
> * 10 nodes with ext4
>
> The goal being comparing performance of MapReduce jobs reading from and
> writing to HBase (or just HDFS).
>
>
> And does anyone have any reason to believe doing the above would be super
> risky and cause data loss?
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/


Since Hadoop abstracts you from the filesystem guts the underlying file
system chosen can be mixed and matched. you can even mix and match the
disks on a single machine.

I have found that ext3 performance gets noticeably poor as disks gets full.
I captured system vitals from a before and after ext3 to ext4 upgrade.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/a_great_reason_to_use

Also if you want to get the most out of your disks read this:

http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/

XFS should is usually described as on par or slightly better then ext4.
However anecdotally most hardcore sysadmins I know can account for one XFS
"i lost my super block" stories :)


Re: [announce] Accord: A high-performance coordination service for write-intensive workloads

2011-09-23 Thread Edward Capriolo
The cages library http://code.google.com/p/cages/ seems to be similar.

On Fri, Sep 23, 2011 at 4:52 PM, Ted Dunning  wrote:

> This is not correct.  You can mix and match reads, writes and version
> checks
> in a multi.
>
> 2011/9/23 OZAWA Tsuyoshi 
>
> > - Limited Transaction APIs. ZK can only issue write operations (write,
> > del) in a transaction(multi-update).
> >
>


Re: HBase and Cassandra on StackOverflow

2011-09-01 Thread Edward Capriolo
On Wed, Aug 31, 2011 at 1:34 AM, Time Less  wrote:

> Most of your points are dead-on.
>
> > Cassandra is no less complex than HBase. All of this complexity is
> > "hidden" in the sense that with Hadoop/HBase the layering is obvious --
> > HDFS, HBase, etc. -- but the Cassandra internals are no less layered.
> >
> > Operationally, however, HBase is more complex.  Admins have to configure
> > and manage ZooKeeper, HDFS, and HBase.  Could this be improved?
> >
>
> I strongly disagree with the premise[1]. Having personally been involved in
> the Digg Cassandra rollout, and spent up until a couple months ago being in
> part-time weekly contact with the Digg Cassandra administrator, and having
> very close ties to the SimpleGeo Cassandra admin, I know it is a fickle
> beast. Having also spent a good amount of time at StumbleUpon and Mozilla
> (and now Riot Games) I also see first-hand that HBase is far more stable
> and
> -- dare I say it? -- operationally more simple.
>
> So okay, HBase is "harder to set up" if following a step-by-step guide on a
> wiki is "hard,"[2] but it's FAR easier to administer. Cassandra is rife
> with
> cascading cluster failure scenarios. I would not recommend running
> Cassandra
> in a highly-available high-volume data scenario, but don't hesitate to do
> so
> for HBase.
>
> I do not know if this is a guaranteed (provable due to architecture)
> result,
> or just the result of the Cassandra community being... how shall I say...
> hostile to administrators. But then, to me it doesn't matter. Results do.
>
> --
> Tim Ellis
> Data Architect, Riot Games
> [1] That said, the other part of your statement is spot-on, too. It's
> surely
> possible to improve the HBase architecture or simplify it.
> [2] I went from having never set up HBase nor ever used Chef to having
> functional Chef recipes that installed a functional HBase/HDFS cluster in
> about 2 weeks. From my POV, the biggest stumbling point was that HDFS by
> default stores critical data in the underlying filesystem's /tmp directory
> by default, which is, for lack of a better word, insane. If I had to
> suggest
> how to simplify "HBase installation," I'd ask for sane HDFS config files
> that are extremely common and difficult-to-ignore.
>

Why are you quoting "harder" what was said was "more complex". Setting up N
things is more complex then setting up a single thing.

First, you have to learn:
1) Linux HA
2) DRDB

Right out of the gate just to have a redundant name node.

This is not easy, fast, or simple. In fact this is quite a pain.
http://docs.google.com/viewer?a=v&q=cache:9rnx-eRzi1AJ:files.meetup.com/1228907/Hadoop%2520Namenode%2520High%2520Availability.pptx+linux+ha+namenode&hl=en&gl=us&pid=bl&srcid=ADGEESig5aJNVAXbLgBwyc311sPSd88jUJbKHx4z2PQtDKHnmM1FuCJpg2IUyqi5JrmUL3RbCb8QRYsjHnP74YuKQfOQXoUZxnhrCy6N1kVpiG1jNi4zhqoKlUTmoDaqS1NegCFb6-WM&sig=AHIEtbQbjN1Olwxui5JmywdWzhqv4Hq3tw&pli=1

Doing it properly involves setting up physical wires between servers or link
aggregation groups. You can't script having someone physically run crossover
cables. You need your switching engineer to set up LAG's.
Also you may notice that everyone that describes this setup is also
describing it using linux-ha V1 which was deprecated for over 2 years. Which
also demonstrates how this process is so complicated people tend to touch it
and never touch it again because of how fragile it is.

You are also implying that following the wiki is easy. Personally, I find
that the wiki has fine detail, but it is confusing.
Here is why.

"1.3.1.2. hadoop

This version of HBase will only run on Hadoop 0.20.x. It will not run on
hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an
HDFS that has a durable sync. Currently only the branch-0.20-append branch
has this attribute[1]. No official releases have been made from this branch
up to now so you will have to build your own Hadoop from the tip of this
branch. Michael Noll has written a detailed blog, Building an Hadoop 0.20.x
version for HBase 0.90.2, on how to build an Hadoop from branch-0.20-append.
Recommended.

Or rather than build your own, you could use Cloudera's CDH3. CDH has the
0.20-append patches needed to add a durable sync (CDH3 betas will suffice;
b2, b3, or b4)."

So the setup starts by recommending rolling your own hadoop (pain in the
ass). OR using a beta ( :(  ).

Then it gets onto hbase it branches into “Standalone HBase” and Section
1.3.2.2, “Distributed”
Then it branches into "psuedo distributed" and "full distributed" , then the
zookeeper section offers you two options "1.3.2.2.2.2. ZooKeeper",
"1.3.2.2.2.2.1. Using existing ZooKeeper ensemble" .

Not to say this is hard or impossible, but it is a lot of information to
digest and all the branching decisions are hard to understand to a first
time user.

Uppercasing the word FAR does not prove to me that hbase is easier to
administer nor does the your employment history or second hand stories
unnamed from people you know. 

Re: HBase and Cassandra on StackOverflow

2011-08-31 Thread Edward Capriolo
On Tue, Aug 30, 2011 at 1:42 PM, Joe Pallas wrote:

>
> On Aug 30, 2011, at 2:47 AM, Andrew Purtell wrote:
>
> > Better to focus on improving HBase than play whack a mole.
>
> Absolutely.  So let's talk about improving HBase.  I'm speaking here as
> someone who has been learning about and experimenting with HBase for more
> than six months.
>
> > HBase supports replication between clusters (i.e. data centers).
>
> That’s … debatable.  There's replication support in the code, but several
> times in the recent past when someone asked about it on this mailing list,
> the response was “I don't know of anyone actually using it.”  My
> understanding of replication is that you can't replicate any existing data,
> so unless you activated it on day one, it isn't very useful.  Do I
> misunderstand?
>
> > Cassandra does not have strong consistency in the sense that HBase
> provides. It can provide strong consistency, but at the cost of failing any
> read if there is insufficient quorum. HBase/HDFS does not have that
> limitation. On the other hand, HBase has its own and different scenarios
> where data may not be immediately available. The differences between the
> systems are nuanced and which to use depends on the use case requirements.
>
> That's fair enough, although I think your first two sentences nearly
> contradict each other :-).  If you use N=3, W=3, R=1 in Cassandra, you
> should get similar behavior to HBase/HDFS with respect to consistency and
> availability ("strong" consistency and reads do not fail if any one copy is
> available).
>
> A more important point, I think, is the one about storage.  HBase uses two
> different kinds of files, data files and logs, but HDFS doesn't know about
> that and cannot, for example, optimize data files for write throughput (and
> random reads) and log files for low latency sequential writes.  (For
> example, how could performance be improved by adding solid-state disk?)
>
> > Cassandra's RandomPartitioner / hash based partitioning means efficient
> MapReduce or table scanning is not possible, whereas HBase's distributed
> ordered tree is naturally efficient for such use cases, I believe explaining
> why Hadoop users often prefer it. This may or may not be a problem for any
> given use case.
>
> I don't think you can make a blanket statement that random partitioning
> makes efficient MapReduce impossible (scanning, yes).  Many M/R tasks
> process entire tables.  Random partitioning has definite advantages for some
> cases, and HBase might well benefit from recognizing that and adding some
> support.
>
> > Cassandra is no less complex than HBase. All of this complexity is
> "hidden" in the sense that with Hadoop/HBase the layering is obvious --
> HDFS, HBase, etc. -- but the Cassandra internals are no less layered.
>
> Operationally, however, HBase is more complex.  Admins have to configure
> and manage ZooKeeper, HDFS, and HBase.  Could this be improved?
>
> > With Cassandra, all RPC is via Thrift with various wrappers, so actually
> all Cassandra clients are second class in the sense that jbellis means when
> he states "Non-Java clients are not second-class citizens".
>
> That's disingenuous.  Thrift exposes all of the Cassandra API to all of the
> wrappers, while HBase clients who want to use all of the HBase API must use
> Java.  That can be fixed, but it is the status quo.
>
> joe
>
>
Hooked into another Cassandra hbase thread...

Cassandra's RandomPartitioner / hash based partitioning means efficient
MapReduce or table scanning is not possible, whereas HBase's distributed
ordered tree is naturally efficient for such use cases, I believe explaining
why Hadoop users often prefer it. This may or may not be a problem for any
given use case.

Many people can and do benefit with this property of HBase.  Efficient
map/reduce still strikes me as an oxymoron :) Yes you can 'push down'
something like 'WHERE key > x and key < y', It is pretty nifty. That does
not really bring you all the way to complex queries.  Cassandra now has
support for built in secondary indexes, and I think soon users will be able
to 'push down' where clauses for 'efficent' map reduce. Also you can
currently range scan on columns (in both directions) in c* which are
efficient. So if you can turn a key ranging design into a column ranging
design you can get the same effect. With both systems Hbase and Cassandra
you likely end up needing to design data around your queries.

Cassandra is no less complex than HBase. All of this complexity is "hidden"
in the sense that with Hadoop/HBase the layering is obvious -- HDFS, HBase,
etc. -- but the Cassandra internals are no less layered.

*This is an opinion*.  I will disagree on this one. For example, The
Cassandra gossip protocol exchanges two facts (IMHO) 'the state of the ring
UP/DOWN' and the 'token ownership' of nodes. This information only changes
when nodes join or leave the cluster. On the hbase side of things many small
regions are splitting and moving 

Re: How can i disable filesystem cache

2011-08-25 Thread Edward Capriolo
On Thu, Aug 25, 2011 at 1:35 PM, Li Pi  wrote:

> You could just use a test size that always fits within the hbase cache.
> On Aug 25, 2011 12:50 AM, "Gui Zhang"  wrote:
> > I want test hbase cache perfermance, as i know linux alway try to cache
> the
> > file, so i would like to disable filesystem cache.
> > my hbase is build on top of hadoop, OS is redhat linux。
> >
> > So here i actually need disable fs cache for hadoop files.
> > Is this possible and how?
> >
> > Thanks
> > Gui
>

For a system with 16GB RAM you can set your -Xms and Xmx to a very high
number like 14 of 15GB. Or run some other process that hogs all your system
memory, like write a C+program that allocates huge a really large linked
list and then sleeps. You really do not want to take away ALL of your VFS
file cache it gets very painful when your inode table is not cached or the
libraries that bash uses are not cached.


Re: TTL for cell values

2011-08-14 Thread Edward Capriolo
On Sunday, August 14, 2011, Ian Varley  wrote:
>> "I don't think anyone is well served by that kind of shallow analysis."
>
>
> You're right, Andy; sorry if it came off sounding flip. My point was
simply that the idea of a persistent data store with a configuration setting
that makes the most current version of your data disappear without an
explicit delete is very counter-intuitve for traditional database folks like
me. Durability is the first, most inviolate rule, and this setting subverts
it in a way that is (at least for me) not obvious at first, and differs
drastically from the max versions setting. Maybe my confusion was due to the
fact that I was looking for specific behavior (HBASE-4071, essentially). I
totally see your point, though; putting it the way I did makes for a rather
alarming pull quote. :(
>
> I'm not at all suggesting we should alter the existing behavior (as if
that were even possible at this point); this is a useful setting for data
that's basically just a cache. But this is an area where the road from RDBMS
to HBase might be a little bumpy for folks, and adding a new option would
also have the advantage of making it even more clear what TTL is for.
>
> Ian
>
>
> On Aug 13, 2011, at 11:28 PM, "Andrew Purtell" 
wrote:
>
>>>  When I was talking to someone the other day about the current TTL
policy, he was like "WTF, who would want that, it eats your data?"
>>
>> I don't think anyone is well served by that kind of shallow analysis.
>>
>> The TTL feature was introduced for the convenience of having the system
automatically garbage collect transient data. If you set a TTL on a column
family, you are telling the system that the data shall expire after that
interval elapses, that the data is only useful for the configured time
period. If the data should not actually be considered transient, then
configuring a TTL is the wrong thing to do -- at least currently.
>>
>>>  "TTL except for most recent"
>>
>> HBASE-4071 is a useful and good idea.
>>
>> Best regards,
>>
>>
>> - Andy
>>
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
>>
>>
>>> 
>>> From: Ian Varley 
>>> To: "user@hbase.apache.org" 
>>> Sent: Saturday, August 13, 2011 8:24 PM
>>> Subject: Re: TTL for cell values
>>>
>>> So, what you're saying is:
>>>
>>> http://lmgtfy.com/?q=hbase+ttl+remove+all+versions+except+most+recent
>>>
>>> :)
>>>
>>> I like the idea of making this pluggable (via the coprocessor framework,
or otherwise). But I also think this is a fundamental enough policy option
that making it hard-coded might be a good idea. When I was talking to
someone the other day about the current TTL policy, he was like, "WTF, who
would want that, it eats your data?". There's no such thing as a "keep 0
versions" option, and thus no way to accidentally lose your most current
data using that approach. But with the TTL version there is, which is (IMO)
counter-intuitive for those coming from an RDBMS background.
>>>
>>> Commented thusly in the JIRA. :)
>>>
>>> Ian
>>>
>>> On Aug 13, 2011, at 8:00 PM, lars hofhansl wrote:
>>>
>>> Hey Ian, (how are things :)
>>>
>>> I just stumbled across https://issues.apache.org/jira/browse/HBASE-4071.
>>>
>>> -- Lars
>>>
>>>
>>> 
>>> From: Ian Varley mailto:ivar...@salesforce.com>>
>>> To: "user@hbase.apache.org" <
user@hbase.apache.org>
>>> Sent: Saturday, August 13, 2011 6:51 PM
>>> Subject: TTL for cell values
>>>
>>> Hi all,
>>>
>>> Quick clarification on TTL for cells. The concept makes sense (instead
of "keep 3 versions" you say "keep versions more recent than time T"). But,
if there's only 1 value in the cell, and that value is older than the TTL,
will it also be deleted?
>>>
>>> If so, has there ever been discussion of a "TTL except for most recent"
option? (i.e. you want the current version to be permanently persistent, but
also want some time-based range of version history, so you can peek back and
get consistent snapshots within the last hour, 6 hours, 24 hours, etc). TTL
seems perfect for this, but not if it'll chomp the current version of cells
too! :)
>>>
>>> Thanks!
>>> Ian
>>>
>>>
>>>
>>>
>

I am slightly confused now. Time to live is used in networking , after n
hops drop this packet. Also used I'm memcache , expire this data n seconds
after insert.

I do not know of any specific ttl features in rdbms so I do not understand
why someone would expect  ttl to he permanently durable.


Re: Mongo vs HBase

2011-08-10 Thread Edward Capriolo
On Wed, Aug 10, 2011 at 4:26 PM, Li Pi  wrote:

> You'll have to build your own secondary indexes for now.
>
> On Wed, Aug 10, 2011 at 1:15 PM, Laurent Hatier  >wrote:
>
> > Yes, i have heard this index but is it available on hbase 0.90.3 ?
> >
> > 2011/8/10 Chris Tarnas 
> >
> > > Hi Laurent,
> > >
> > > Without more details on your schema and how you are finding that number
> > in
> > > your table it is impossible to fully answer the question. I suspect
> what
> > you
> > > are seeing is mongo's native support for secondary indexes. If you were
> > to
> > > add secondary indexes in HBase then retrieving that row should be on
> the
> > > order of 3-30ms. If that is you main query method then you could
> > reorganize
> > > your table to make that long number your row key, then you would get
> even
> > > faster reads.
> > >
> > > -chris
> > >
> > >
> > > On Aug 10, 2011, at 10:02 AM, Laurent Hatier wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to know why MongoDB is faster than HBase to select
> items.
> > > > I explain my case :
> > > > I've inserted 4'000'000 lines into HBase and MongoDB and i must
> > calculate
> > > > the geolocation with the IP. I calculate a Long number with the IP
> and
> > i
> > > go
> > > > to find it into the 4'000'000 lines.
> > > > it's take 5 ms to select the right row with Mongo instead of HBase
> > takes
> > > 5
> > > > seconds.
> > > > I think that the reason is the method : cur.limit(1) with MongoDB but
> > is
> > > > there no function like this with HBase ?
> > > >
> > > > --
> > > > Laurent HATIER
> > > > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
> > >
> > >
> >
> >
> > --
> > Laurent HATIER
> > Étudiant en 2e année du Cycle Ingénieur à l'EISTI
> >
>

http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale


Re: YCSB Benchmarking for HBase

2011-08-03 Thread Edward Capriolo
On Wed, Aug 3, 2011 at 6:10 AM, praveenesh kumar wrote:

> Hi,
>
> Anyone working on YCSB (Yahoo Cloud Service Benchmarking) for HBase ??
>
> I am trying to run it, its giving me error:
>
> $ java -cp build/ycsb.jar com.yahoo.ycsb.CommandLine -db
> com.yahoo.ycsb.db.HBaseClient
>
> YCSB Command Line client
> Type "help" for command line help
> Start with "-help" for usage info
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/conf/Configuration
>at java.lang.Class.getDeclaredConstructors0(Native Method)
>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2406)
>at java.lang.Class.getConstructor0(Class.java:2716)
>at java.lang.Class.newInstance0(Class.java:343)
>at java.lang.Class.newInstance(Class.java:325)
>at com.yahoo.ycsb.CommandLine.main(Unknown Source)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.conf.Configuration
>at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
>... 6 more
>
> By the error, it seems like its not able to get Hadoop-core.jar file, but
> its already in the class path.
> Has anyone worked on YCSB with hbase ?
>
> Thanks,
> Praveenesh
>


I just did
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/ycsb_cassandra_0_7_6.

For hbase I followed the steps here:
http://blog.lars-francke.de/2010/08/16/performance-testing-hbase-using-ycsb/

I also followed the comment in the bottom to make sure the hbase-site.xml
was on the classpath.

Startup script looks like this:
CP=build/ycsb.jar:db/hbase/conf/
for i in db/hbase/lib/* ; do
CP=$CP:${i}
done
#-load load the workload
#-t run the workload
java -cp $CP com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.HBaseClient -P
workloads/workloadb \


Re: hbck -fix

2011-07-05 Thread Edward Capriolo
On Mon, Jul 4, 2011 at 1:28 PM, Stack  wrote:

> On Sun, Jul 3, 2011 at 12:39 AM, Andrew Purtell 
> wrote:
> > I've done exercises in the past like delete META on disk and recreate it
> with the earlier set of utilities (add_table.rb). This always "worked for
> me" when I've tried it.
> >
>
> We need to update add_table.rb at least.  The onlining of regions was
> done by the metascan.  It no longer exists in 0.90.  Maybe a
> disable/enable after an add_table.rb would do but probably better to
> revamp and merge it with hbck?
>
>
> > Results from torture tests that HBase was subjected to in the timeframe
> leading up to 0.90 also resulted in better handling of .META. table related
> errors. They are fortunately demonstrably now rare.
> >
>
> Agreed.
>
>
> >My concern here is getting repeatable results demonstrating HBCK
> weaknesses will be challenging.
> >
>
> Yes.  This is the tough one.  I was hoping Wayne had a snapshot of
> .META. to help at least characterize the problem.
>
> (This does sound like something our Dan Harvey ran into recently on an
> hbase 0.20.x hbase.  Let me go back to him.  He might have some input
> that will help here.)
>
> St.Ack
>

If the root of this issue is the master filling up it is not totally an
hbase issue. If your search the hadoop mailing list you will find people
who's NameNode disk fills up and had quite a catastrophic, hard to recover
from, failure.

Monitor the s#it out of your SPOFs. To throw something very anecdotal in
here, I find not many data stores recover from full disk errors well.


Re: hadoop 0.20.203.0 Java Runtime Environment Error

2011-07-01 Thread Edward Capriolo
That looks like an ancient version of java. Get 1.6.0_u24 or 25 from oracle.

Upgrade to a recent java and possibly update your c libs.

Edward

On Fri, Jul 1, 2011 at 7:24 PM, Shi Yu  wrote:

> I had difficulty upgrading applications from Hadoop 0.20.2 to Hadoop
> 0.20.203.0.
>
> The standalone mode runs without problem.  In real cluster mode, the
> program freeze at map 0% reduce 0% and there is only one attempt  file in
> the log directory. The only information is contained in stdout file :
>
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGFPE (0x8) at pc=0x2ae751a87b83, pid=5801, tid=1076017504
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (1.6.0-b105 mixed mode)
> # Problematic frame:
> # C  [ld-linux-x86-64.so.2+0x7b83]
> #
> # An error report file with more information is saved as hs_err_pid5801.log
> #
> # If you would like to submit a bug report, please visit:
> #   
> http://java.sun.com/webapps/**bugreport/crash.jsp
>
>
> (stderr and syslog are empty)
>
> So what is the problem in ld-linux-x86-64.so.2+0x7b83 ?
>
> The program I was testing uses identity Mapper and Reducer, and the input
> file is a single 1M plain text file.   Then I found several hs_err logs
> under the default directory of hadoop, I attach the log file in this email.
>
> The reason I upgrade from 0.20.2 is I had lots of disk check error when
> processing TB data when the disks still have plenty of space. But then I was
> stuck at getting a simple toy problem to work in 0.20.203.0.
>
> Shi
>
>
>


Re: Region replication?

2011-04-19 Thread Edward Capriolo
On Tue, Apr 19, 2011 at 4:09 PM, Ted Dunning  wrote:
> This is kind of true.
>
> There is only one regionserver to handle the reads, but there are
> multiple copies of the data to handle fail-over.
>
> On Tue, Apr 19, 2011 at 12:33 PM, Otis Gospodnetic
>  wrote:
>> My question has to do with one of the good comments from Edward Capriolo, who
>> pointed out that  some of the Configurations he described in his Cassandra  
>> as
>> Memcached talk (
>> http://www.edwardcapriolo.com/roller/edwardcapriolo/resource/memcache.odp ) 
>> are
>> not possible with HBase because in HBase there is only 1 copy of any given
>> Region and it lives on a single RegionServer (I'm assuming this is correct?),
>> thus making it impossible to spread reads of data from one Region over 
>> multiple
>> RegionServers:
>

It is not "kinda of true". It "is" true.

A summary of slide 22 is:
Cassandra
20 nodes
Replication Factor 20
Results in:
20 nodes capable of serving this reads!

With HBase, regardless of how many HDFS file copies exist, only one
RegionServer can actively serve a region.


Re: asynch api mailing list?

2011-02-27 Thread Edward Capriolo
On Sun, Feb 27, 2011 at 1:29 PM, Hiller, Dean  (Contractor)
 wrote:
> I had a question on the hbase asynch api.
>
>
>
> I find it odd that there is a PleaseThrottleException as on most asynch
> systems I have worked on, each node has a queue and when the queue fills
> up, the nic buffer in it fills up which means the remote nic buffer then
> fills up and then the client should be blocking all writes which means
> his incoming buffer fills up, etc. etc. (or he can spin needlessly but
> that is usually a very very bad choice...in fact, I have never seen that
> work out well and always was reverted).
>
>
>
> The nice thing I always found with asynch systems is you completely
> control your memory footprint of the server with the incoming
> queue...direct relationship between that queue size and the memory of
> the node used.  The other things that was always done was asynch reads
> but always do synch writes(or block on a lock if write can't go through
> until it can go through to slow down the upstream system and throttle
> it).ie. it is a self throttling system when done this way so there
> is no need for a PleaseThrottleException which I find odd.
>
>
>
> Maybe I am missing something though  As a client, I definitely want
> async reads but my writes should only return if there was room in the
> nic buffer to write it out, otherwise it should block and hold up my
> client so my client doesn't have to do any extra coding for a
> PleaseThrottleException.
>
>
>
> In this solution, most of my writes will be asynch right up until hbase
> starts becoming a bottleneck.
>
>
>
> Thanks,
>
> Dean
>
>
>
>
>
>
> This message and any attachments are intended only for the use of the 
> addressee and
> may contain information that is privileged and confidential. If the reader of 
> the
> message is not the intended recipient or an authorized representative of the
> intended recipient, you are hereby notified that any dissemination of this
> communication is strictly prohibited. If you have received this communication 
> in
> error, please notify us immediately by e-mail and delete the message and any
> attachments from your system.
>
>

Sorry to hijack the thread but I noticed thrift 5.0 generates both
synchronous and asyn stubs. Why would someone pick async-hbase vs
thrift-async?


Re: Major compactions and OS cache

2011-02-16 Thread Edward Capriolo
On Wed, Feb 16, 2011 at 3:09 PM, Jason Rutherglen
 wrote:
> This comment 
> https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=12991734&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991734
> is interesting as in Lucene the IO cache is relied on, one would
> assume that HBase'd be the same?
>
> On Wed, Feb 16, 2011 at 11:48 AM, Ryan Rawson  wrote:
>> That would be cool, I think we should probably also push for HSDF-347
>> while we are at it as well. The situation for HDFS improvements has
>> not been good, but might improve in the mid-future.
>>
>> Thanks for the pointer!
>> -ryan
>>
>> On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
>>  wrote:
 One of my coworker is reminding me that major compactions do have the
 well know side effect of slowing down a busy system.
>>>
>>> I think where this is going is the system IO cache problem could be
>>> solved with something like DirectIOLinuxDirectory:
>>> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
>>> would be integrating DIOLD or it's underlying native implementation
>>> into HDFS somehow?
>>>
>>
>

This seems to be a common issue across the "write once and compact"
model, it tends to vaporizes page cache. Cassandra is working on
similar trickery at the file system level. Another interesting idea is
the concept of re-warming the cache after a compaction
https://issues.apache.org/jira/browse/CASSANDRA-1878.

I would assume that users of HBase rely more on the HBase Block cache
then the VFS cache. Our of curiosity do people who run with 24 GB
memory. 4GB Xmx DataNode 16GB Xmx RegionServer (block cache), 4MB vfs
cache?

I always suggest firing off the major compact at a low traffic time
(if you have such a time) so it has the least impact.


Re: Hadoop setup question.

2011-02-01 Thread Edward Capriolo
On Tue, Feb 1, 2011 at 3:17 PM, Joseph Coleman
 wrote:
> Sorry that was a typo on the amount of Master node although  what is the
> limitation of how many masters you can have? Thank you for the feedback on
> the JBOD however, I am a little lost on the setup of it. Looking at a Dell
> 1950 or 2950 I do not see that as an option in the raid controller setup
> nor do I see that as an option when setting up Ubuntu. Is this a hardware
> or software option after the fact? Do I just setup raid0 do a ext root vol
> and then run a command to convert to JBOD. Sorry for the ignorance this is
> just new to me and I want to get it right the first time.
>
> Thanks
>
>
>
>
> On 2/1/11 10:06 AM, "Sean Bigdatafun"  wrote:
>
>>- hbase-user
>>
>>No raid should be used, use JBOD instead. I do not think you can setup 3
>>master nodes in current Hadoop version, can you explain why you believe
>>so?
>>
>>Thanks,
>>
>>
>>On Tue, Feb 1, 2011 at 7:50 AM, Joseph Coleman <
>>joe.cole...@infinitecampus.com> wrote:
>>
>>> Hi all not sure where to ask this question but here it goes. I have been
>>> playing with Hadoop for a while now in a test environment before we
>>>setup
>>> and deploy a productions environment. I am using Hadoop 0.20.0  on
>>>Ubuntu
>>> 10.04 LTS install on Dell 1950's currently.
>>>
>>> My question is what raid should I be using for my data nodes? I haven't
>>> come across anything that clearly spells it out I have used raid1 and
>>>then
>>> EXT4 filesystem but I know this isn't right after further research but
>>>not
>>> sure what do do. I will be setting up 3 masters in a cluster which I
>>>will
>>> raid out. And roughly 10 datanodes running hdfs and hbase and a separate
>>> zookeeper cluster. Any thoughts or recommendations on the clustering
>>>would
>>> be much appreciated.
>>>
>>> Thanks,
>>> Joe
>>>
>>>
>>>
>>
>>
>>--
>>--Sean
>
>

JBOD in your case means either:
1) Do not use your RAID controller
2) Setup your raid controller with N devices using a 1 to 1 mapping
with physical disks
/dev/sda -> disk1
/dev/sdb -> disk2
...
Good luck.


Re: Recommended Node Size Limits

2011-01-15 Thread Edward Capriolo
On Sat, Jan 15, 2011 at 8:45 AM, Wayne  wrote:
> Not everyone is looking for a distributed memcache. Many of us are looking
> for a database that scales up and out, and for that there is only one
> choice. HBase does auto partitioning with regions; this is the genius of the
> original bigtable design. Regions are logical units small enough to be fast
> to copy around/replicate, compact, and access with random disk I/O.
> Cassandra has NOTHING like this. Cassandra partitions great to the node, but
> then the node is one logical unit with some very very very big CF files.
> What DBA can sleep at night with their data in 3x 350GB files? Our
> definition of big data is big data ON the node and big data across the
> nodes. Cassandra can not handle large nodes; 30 hour compaction is real on a
> 1TB node and is a serious problem with scale. HBase compresses with LZOP so
> your data size is already much smaller, you can up the region size and
> handle up to 100 "active" regions and a heck of lot more inactive regions
> (perfect fit for time series data), and they all are compacted and accessed
> individually. There is no common java limitation here...
>
> I spent 6 months with Cassandra to get it to work perfectly, and then
> totally abandoned it due to fundamental design flaws like this. This in
> conjunction with having to ask ALL copies of data for a consistent read
> which causes 3x the disk i/o on reads. Cassandra is not a good choice if
> consistency is important or if scaling UP nodes is important. For those of
> us looking to scale out/up a relational database (and we can not afford to
> put 50TB in RAM) HBase is the only choice in the nosql space. Cassandra is a
> great piece of software and the people behind it are top notch. Jonathan
> Ellis has personally helped me with many problems and helped to get our
> cluster to work flawlessly. Cassandra has many great uses and is a great
> choice if you can keep most data in cache, but it is no relational database
> replacement. I was bought into the hype and I hope others will be able to
> get real assessments of differences to make the best decision for their
> needs.
>
>
> On Fri, Jan 14, 2011 at 10:50 PM, Edward Capriolo 
> wrote:
>
>> On Fri, Jan 14, 2011 at 2:02 PM, Jonathan Gray  wrote:
>> > How about Integer.MAX_VALUE (or I believe 0 works) to completely disable
>> splits?
>> >
>> > As far what we are running with today, we do have clusters with regions
>> over 10GB and growing.  There has been a lot of work in the compaction logic
>> to make these large regions more efficient with IO (by not compacting
>> big/old files and such).
>> >
>> > JG
>> >
>> >> -Original Message-
>> >> From: Ted Dunning [mailto:tdunn...@maprtech.com]
>> >> Sent: Friday, January 14, 2011 10:12 AM
>> >> To: user@hbase.apache.org
>> >> Subject: Re: Recommended Node Size Limits
>> >>
>> >> Way up = ??
>> >>
>> >> 1GB?
>> >>
>> >> 10GB?
>> >>
>> >> If 1GB, doesn't this mean that you are serving only 64GB of data per
>> node?
>> >>  That seems really, really small.
>> >>
>> >> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray  wrote:
>> >>
>> >> > Then you can turn your split size way up, effectively preventing
>> >> > further splits.  Again, this is for randomly distributed requests.
>> >> >
>> >
>>
>> I would think both systems (Cassandra || HBase) have large node
>> pitfalls. If you want >500GB a node storage, low latency (<5ms), and
>> random lookup you better have a lot of RAM!
>>
>> Current JVM's have diminishing returns past 24 GB of heap memory so in
>> JVM caching for that volume is out. Both benefit from VFS cache, but
>> both have cache churn from compacting.
>>
>> For applications that are mostly random read, the data to ram ratio is
>> a biggest factor. I have been curious for a while on what peoples ram
>> - data ration is.
>>
>


Defiantly the Cassandra compaction large disk volumes is less elegant
then the small region design. I think the best alternative their is to
lose your RAID-0/ RAID5 and just run one Cassandra instance per JBOD
disk. There is some overhead and management with that but hey then the
largest compaction are not much of an issue any more. (its something
like a virtualized solution)

K. It seemed from your post asking "can hbase handle 1k regions per
server" you were not sure if hbase could handle it either. However it
sounds like you have it figured it out.


Re: Recommended Node Size Limits

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 2:02 PM, Jonathan Gray  wrote:
> How about Integer.MAX_VALUE (or I believe 0 works) to completely disable 
> splits?
>
> As far what we are running with today, we do have clusters with regions over 
> 10GB and growing.  There has been a lot of work in the compaction logic to 
> make these large regions more efficient with IO (by not compacting big/old 
> files and such).
>
> JG
>
>> -Original Message-
>> From: Ted Dunning [mailto:tdunn...@maprtech.com]
>> Sent: Friday, January 14, 2011 10:12 AM
>> To: user@hbase.apache.org
>> Subject: Re: Recommended Node Size Limits
>>
>> Way up = ??
>>
>> 1GB?
>>
>> 10GB?
>>
>> If 1GB, doesn't this mean that you are serving only 64GB of data per node?
>>  That seems really, really small.
>>
>> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray  wrote:
>>
>> > Then you can turn your split size way up, effectively preventing
>> > further splits.  Again, this is for randomly distributed requests.
>> >
>

I would think both systems (Cassandra || HBase) have large node
pitfalls. If you want >500GB a node storage, low latency (<5ms), and
random lookup you better have a lot of RAM!

Current JVM's have diminishing returns past 24 GB of heap memory so in
JVM caching for that volume is out. Both benefit from VFS cache, but
both have cache churn from compacting.

For applications that are mostly random read, the data to ram ratio is
a biggest factor. I have been curious for a while on what peoples ram
- data ration is.


Re: Stack assistance

2010-09-04 Thread Edward Capriolo
On Sun, Sep 5, 2010 at 12:27 AM, phil young  wrote:
> I'm interested in doing joins in Hive between HBase tables and between HBase
> and Hive tables.
>
> Can someone suggest an appropriate stack to do that? i.e.
> Is it possible to use HBase 0.89
> If I use HBase 0.20.6, do I still need to apply HBASE-2473
> Should I go with the trunk versions of any of these (e.g. Hive), or even
> CDH3 (which appears to not have the hive-hbase handler)?
>
> I'd appreciate input from anyone who has done this.
>
> Thanks
>


You would define two tables in hive using the hbase storage handler.
Here is the test case.

http://svn.apache.org/viewvc/hadoop/hive/trunk/hbase-handler/src/test/queries/hbase_joins.q?revision=926818&view=markup

>From there you should be able to join hbase to hbase , or hbase to hive

I can not say what version or what patch level you need, but the code
was committed recently so a recent version should go smooth. (I never
tried it so I can not be sure)

Remember that the hbase-storage-handler does full fable scans of your
data so if you plan on querying large amounts of data your are not
going to get low latency results (You probably already know that)


Re: Please help me overcome HBase's weaknesses

2010-09-04 Thread Edward Capriolo
On Sun, Sep 5, 2010 at 12:07 AM, Jonathan Gray  wrote:
>> > But your boss seems rather to be criticizing the fact that our system
>> > is made of components.  In software engineering, this is usually
>> > considered a strength.  As to 'roles', one of the bigtable author's
>> > argues that a cluster of master and slaves makes for simpler systems
>> > [1].
>>
>> I definitely agree with you. However, my boss considers the simplicity
>> from
>> the users' viewpoint. More components make the system more complex for
>> users.
>
> Who are the users?  Are they deploying the software and responsible for 
> maintaining backend databases?
>
> Or are there backend developers, frontend developers, operations, etc?
>
> In my experience, the "users" are generally writing the applications and not 
> maintaining databases.  And in the case of HBase, and it's been said already 
> on this thread, that users generally have an easier time with the data and 
> consistency models.
>
> Above all, I think the point made by Stack earlier is extremely relevant.  
> Are you using HDFS already?  Do you have needs for ZK?  When you do, HBase in 
> an additional piece to this stack and generally fits in nicely.  From an 
> admin/ops POV, the learning curve is minimal once familiar with these other 
> systems.  And even if you aren't already using Hadoop, might you in the 
> future?
>
> If you don't and never will, then the single-component nature of Cassandra 
> may be more appealing.
>
> Also, vector clocks are nice but are still a distributed algorithm.  We've 
> been doing lots of work benchmarking and optimizing increments recently, 
> pushing extremely high throughput on relatively small clusters.  I would not 
> expect being able to achieve this level of performance or concurrency with 
> any kind of per-counter distribution.  Certainly not while providing the 
> strict atomicity and consistency guarantees that HBase provides.
>
> I've never implemented counters w/ vector clocks so I could be wrong.  But I 
> do know that I could explain how we implement counters in a performant, 
> consistent, atomic way and you wouldn't have to reach for Wikipedia once ;)
>
> JG
>

(I agree with just about everything on this thread except point 1)

If this was a black and white issue, you would be either a user or a
developer. At this stage both cassandra and hbase are at the stage
where very few people are pure users. I feel if you are checking out
beta versions, applying patches to a source tree, or watching issues,
upgrading 3 times a year, your are more of a developer then a user.

Modular software is great. But if two programs do roughly the same
function, but one is 7 pieces and the other is 1, it is hard to make
the case that modular is better.

cd /home/edward/hadoop/hadoop-0.20.2/src/
[edw...@ec src]$ find . | wc -l
2683

[edw...@ec apache-cassandra-0.6.3-src]$ find . | wc -l
609

I have been working with hadoop for a while now. There is a ticket I
wanted to work on reading the hadoop configuration from ldap. I
figured this would be a relatively quickly thing. After all, the
hadoop conf is just a simple XML file with name value pairs.

[edw...@ec core]$ cd org/apache/hadoop/conf/
[edw...@ec conf]$ ls
Configurable.java  Configuration.java  Configured.java  package.html

[edw...@ec conf]$ wc -l Configuration.java
1301 Configuration.java

Holy crud! Now a good portion of this file is comments, but still.
1301 lines to read and write xml files! The hadoop conf has tons of
stuff to do variable interpolation, xinclude support, the ability to
read configurations as streams, handing for deprecated config file
names.

There is a method in Configuration with this signature:

 public  Class getClass(String name,
 Class defaultValue,
 Class xface) {

My point, is that all the modularity and flexibility does not
translate into much for end users, and for developers that just want
to jump in, I would rather jump into 600 files then 2600 (by the way
that is NOT including hbase)


Re: CDH3 has 2 versions of hbase executable, one work, another doesn't

2010-09-03 Thread Edward Capriolo
Jimmy.

Many people across the forums seem to be confusing cdh with the
upstream hadoop/hive/hbase projects. In this case the problem is in
the cdh packaging and not in upstream hbase.

The proper place to take up packaging issues is with the packager.
When you are new to software this might be confusing.

To be clear hadoop, hive, and hbase are open source products that come
in a tar.gz. if you notice an issue in the core functionality that
would be appropriate for -u...@apache , but if you are running
into a problem with a derivative package this may not be the place to
get that resolved.

On Friday, September 3, 2010, Jinsong Hu  wrote:
> I noticed that CDH3 has 2 executable
>
> /usr/bin/hbase
>
> /usr/lib/hbase/bin/hbase
>
> I compared them and they are different. it turns out that I run
>
> /usr/bin/hbase shell
>
> and then list table, it works, but if I run
>
> /usr/lib/hbase/bin/hbase shell
>
> and list tables, it freezes. In the next distribution, please make 
> /usr/lib/hbase/bin/hbase
> a softlink to /usr/bin/hbase , or the otherway. and make sure the executable 
> works.
>
> Jimmy.
>


Re: response to "The problems with ACID, and how to fix them without going NoSQL"

2010-09-02 Thread Edward Capriolo
On Thu, Sep 2, 2010 at 3:39 PM, Ryan Rawson  wrote:
> The flaws with the paper are insanely obvious if you look at them:
>
> - their solution doesn't run on disk.  Many things get faster when you
> restrict yourself to RAM/flash
> - their solution doesn't scale!  Looks like a shared nothing sharding
> with global transaction ordering and no internal locks.
>
> Or am I missing something big here?
>
> I generally find it tiresome when people bash on bigtable, yet their
> "awesome" thing doesn't scale to multi-PB databases.  Reminds me of
> that "time for an architectural rewrite" which was essentially "if you
> do everything in 1 thread/CPU you dont need locks and are faster".
> This was just the same thing as far as I can tell from skimming the
> paper.
>
> On Thu, Sep 2, 2010 at 12:36 PM, Andrew Purtell  wrote:
>> I've tried to post the below comment twice at
>>
>>    The problems with ACID, and how to fix them without going NoSQL
>>    
>> http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html
>>
>> For whatever reason, it has appeared in the comments section from my 
>> perspective briefly twice and then disappeared twice, so I will just post it 
>> here, because HBase is mentioned in the article a few times, and ... well, 
>> just read. :-)
>>
>
>>
>> Many earlier comments have covered much of what I would say. However, nobody 
>> to date has raised an objection to the mildly offensive contention that "the 
>> NoSQL decision to give up on ACID is the lazy solution to these scalability 
>> and replication issues." Possibly this was not meant in the pejorative 
>> sense, but it reads that way. I would argue the correct term of art here is 
>> pragmatism, not laziness.
>>
>> I am a contributor to the HBase project. HBase is an open source 
>> implementation of the BigTable architecture. Indeed our system does scale 
>> out by substantially relaxing the scope of ACID guarantees. But it is a 
>> gross generalization to suggest "NoSQL" is "NoACID", and somehow lazy in the 
>> pejorative sense, and this mars the argument of the authors. HBase at least 
>> in particular provides durability, row-level atomicity (agree here this is a 
>> nice convenient partition), and favors strong consistency in its design 
>> choices. In this regard, I would also like to bring to your attention that 
>> the authors made an error describing the scope of transactional atomicity 
>> available in BigTable -- the scope is actually the row, not each individual 
>> KV.
>>
>> Also, at least HBase in particular is a big project with several interesting 
>> design/research directions and so does not reduce to a convenient 
>> stereotype: a transactional layer that provides global ACID properties at 
>> user option (that does not scale out like the underlying system but is 
>> nonetheless available), exploration of notions of referential integrity, 
>> even consideration of optional relaxed consistency (read replicas) in the 
>> other direction.
>>
>> Back to the matter of pragmatism: While it is likely most structured data 
>> store users are not building systems on the scale of a globally distributed 
>> search engine, actually that is not too far off the mark for the design 
>> targets of some HBase installations. We indeed do need to work with very 
>> large mutating data sets today and nothing in the manner of a traditional 
>> relational database system is up to the task. The discussion here, while 
>> intriguing, is also rendered fairly academic by the "horrible" performance 
>> if spinning media is used. Flash will not be competitive with spinning media 
>> at high tera- or peta-scale for at least several years yet. Other commenters 
>> have also noticed apparent bottlenecks in the presented design which suggest 
>> a high scale implementation will be problematic.
>>
>> Anyway, it is my belief we are attacking the same set of problems but are 
>> starting at it on opposing sides of a continuum and, ultimately, we shall 
>> meet up somewhere in the middle.
>>
>> September 2, 2010 10:55 AM
>>
>> <<<
>>
>>   - Andy
>>
>>
>>
>>
>>
>>
>

Wait! I thought all the problems were solved 25 years ago? :)


http://cloud.pubs.dbs.uni-leipzig.de/node/27


Re: Initial region loads in hbase..

2010-08-30 Thread Edward Capriolo
On Mon, Aug 30, 2010 at 1:40 PM, Jean-Daniel Cryans  wrote:
> Vidhya,
>
> Thanks for keeping it up :)
>
> If you trace all the references to 003404803994 in the region
> server, what do you see? I think that most of the time is spent
> opening the region sequentially, it'd be nice to confirm.
>
> J-D
>
> On Mon, Aug 30, 2010 at 10:20 AM, Vidhyashankar Venkataraman
>  wrote:
>> Thank you for the responses guys.. I will be looking at the problem later 
>> this week, I guess. The initial eyeballing suggested that the RS's respond 
>> to the master's open very quickly.. And I saw our rrd stats (which is 
>> similar to ganglia) and I couldn't see any obvious IO/CPU/nw bottlenecks: 
>> which means I haven't explored enough :) I will be coming back to the 
>> problem soon and will let you know about my findings..
>>
>>  I have been running scans and planning to run incremental loads on the db 
>> and get some useful numbers.. Scans are blazing fast with just one storefile 
>> per region: 75-95 MBps per node! (this is if locality between RS and data is 
>> kind of preserved: i.e., after a major compact is issued)..
>>
>> I will update on my status soon..
>> Thank you again
>> Vidhya
>>
>> On 8/27/10 5:42 PM, "Jonathan Gray"  wrote:
>>
>> Vidhya,
>>
>> Could you post a snippet of an RS log during this time?  You should be able 
>> to see what's happening between when the OPEN message gets there and the 
>> OPEN completes.
>>
>> Like Stack said, it's probably that its single-threaded in the version 
>> you're using and with all the file opening, your NN and DNs are under heavy 
>> load.  Do you see io-wait or anything else jump up across the cluster at 
>> this time?  You have ganglia setup on this bad boy?
>>
>> JG
>>
>>> -Original Message-
>>> From: saint@gmail.com [mailto:saint@gmail.com] On Behalf Of
>>> Stack
>>> Sent: Friday, August 27, 2010 5:36 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: Initial region loads in hbase..
>>>
>>> In 0.20, open on a regionserver is single-threaded.  Could that be it?
>>>  The server has lots of regions to open and its taking time?  Is the
>>> meta table being beat up?  Could this be holding up region opens?
>>>
>>> Good luck V,
>>> St.Ack
>>>
>>>
>>> On Fri, Aug 27, 2010 at 5:01 PM, Vidhyashankar Venkataraman
>>>  wrote:
>>> > Hi guys,
>>> >  A couple of days back, I had posted a problem on regions taking too
>>> much time to load when I restart Hbase.. I have a table that has around
>>> 80 K regions on 650 nodes (!) ..
>>> >  I was checking the logs in the master and I notice that the time it
>>> takes from assigning a region to a region server to the point when it
>>> recognizes that the region is open in that server takes around 20-30
>>> minutes!
>>> >   Apart from master being the bottleneck here, can you guys let me
>>> know what the other possible cases are as to why this may happen?
>>> >
>>> > Thank you
>>> > Vidhya
>>> >
>>> > Below is an example for region with start key 003404803994 where
>>> the assignment takes place at 22:59 while the confirmation that it got
>>> open came at 23:19...
>>> >
>>> > 2010-08-27 22:59:02,642 DEBUG
>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>> DocDB,003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1.
>>> is not valid;  serverAddress=, startCode=0 unknown.
>>> > 2010-08-27 22:59:02,643 DEBUG
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED
>>> region 73c0f8fdb8ffbc20b9a239d325932ff1 in state = M2ZK_REGION_OFFLINE
>>> > 2010-08-27 22:59:02,645 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>> Event NodeCreated with state SyncConnected with path
>>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1
>>> > 2010-08-27 22:59:02,645 DEBUG
>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>> NodeCreated with path
>>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1
>>> > 2010-08-27 22:59:02,645 DEBUG
>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>> Got zkEvent NodeCreated state:SyncConnected
>>> path:/hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1
>>> > 2010-08-27 22:59:02,645 DEBUG
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
>>> >> 3130640.yst.yahoo.net,b3130680.yst.yahoo.net:/hbase,org.apache.hadoop.h
>>> base.master.HMaster>Created ZNode
>>> /hbase/UNASSIGNED/73c0f8fdb8ffbc20b9a239d325932ff1 in ZooKeeper
>>> > 2010-08-27 22:59:02,646 DEBUG
>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>> UNASSIGNED zNode
>>> DocDB,003404803994,1282947439133.73c0f8fdb8ffbc20b9a239d325932ff1.
>>> in state M2ZK_REGION_OFFLINE
>>> > 2010-08-27 22:59:02,646 DEBUG
>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>> M2ZK_REGION_OFFLINE ] for region 73c0f8fdb8ffbc20b9a239d325932ff1
>>> > 2010-08-27 22:59:02,646 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>> Event NodeChildrenChanged with state SyncConnected with path
>>> /hbase/UNASSIGNED
>>> > 2010-08-

Re: major differences with Cassandra

2010-08-18 Thread Edward Capriolo
On Wed, Aug 18, 2010 at 2:17 PM, Ryan Rawson  wrote:
> Thanks for that bit of feedback.
>
> Right now stumbleupon operates a cluster that handles 20,000 requests a
> second 24/7 for about a year now. Even though we have hbase developers I
> don't think there is any special sauce and anyone could replicate the
> successes we've had. Mozilla is one candidate. There are others who are
> quieter about it.
>
> On Aug 18, 2010 11:11 AM, "Time Less"  wrote:
>> HBase is run by persons who understand (or are willing to hear) the
>> operational requirements of distributed databases in high-volume
>> environments, whereas the Cassandra project isn't.
>>
>> Talks about technical differences are really noise, because they're
> entirely
>> theoretical. When viewed with this knowledge, a lot of the disagreements,
>> flamewars, and shoutfests begin to make sense.
>>
>> As of today, I'm unaware of any major feature Cassandra claims that it
>> actually delivers outside of installations run by the developers
> themselves.
>> Specifically: multi-DC, hinted handoff, compaction, dynamic cluster
> resizing
>> are all fail. The developers will adamantly claim all such features work
>> just fine. Good luck getting any of it to work in YOUR environment.
>>
>> In stark contrast, I am intimately familiar with at least one large HBase
>> installation run by non-developers (at Mozilla).
>>
>> Disclaimers: I am very familiar with the Cassandra product internals,
>> developers, history, and community. I am less familiar with HBase. I might
>> therefore have a rosy view of the HBase community based on ignorance.
> Also,
>> in a low-volume environment, pretty much anything works. Including
>> Cassandra. Or anything else. Any NoSQL. Any SQL. Pick whatever you want
> and
>> run with it.
>>
>>
>> On Fri, Jul 30, 2010 at 9:03 PM, Otis Gospodnetic <
>> otis_gospodne...@yahoo.com> wrote:
>>
>>> I don't have the URL handy, but just the other day I read some
>>> Cassandra/HBase
>>> blog post where Cassandra was described as having no SPOF, but somebody
>>> left
>>> some very "strong comments" calling out that and a few other claims as
>>> false.
>>> Ah, I remember, here is the URL:
>>>
>>>
> http://blog.mozilla.com/data/2010/05/18/riak-and-cassandra-and-hbase-oh-my/
>>>
>>>
>>> Otis
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Hadoop ecosystem search :: http://search-hadoop.com/
>>>
>>>
>>>
>>> - Original Message 
>>> > From: Jeff Zhang 
>>> > To: user@hbase.apache.org
>>> > Sent: Thu, July 8, 2010 1:34:18 AM
>>> > Subject: Re: major differences with Cassandra
>>> >
>>> > HBase do not have super column family.
>>> >
>>> > And I can list the following major difference between hbase and
>>> cassandra (
>>> > welcome any supplement) :
>>> >
>>> > 1. HBase is master-slave architecture, while cassandra has no master,
>>> and
>>> > you can consider it as p2p structure, and it has no single point of
>>> failure.
>>> > 2. HBase is strong consistency while cassandra is eventual consistency
>>> > (although you can tune it to be strong consistency)
>>> >
>>> >
>>> > On Thu, Jul 8, 2010 at 1:26 PM, S Ahmed  wrote:
>>> >
>>> > > Hello!
>>> > >
>>> > > I was hoping some has experiences with both Cassandra and HBase.
>>> > >
>>> > > What are the major differences between Cassandra and HBase?
>>> > >
>>> > > Does HBase have the concept of ColumnFamilies and SuperColumnFamilies
>>> like
>>> > > Cassandra?
>>> > >
>>> > > Where in the wiki does it go over designing a data model?
>>> > >
>>> > >
>>> > > thanks!
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards
>>> >
>>> > Jeff Zhang
>>> >
>>>
>>
>>
>>
>> --
>> timeless(ness)
>

You said:
As of today, I'm unaware of any major feature Cassandra claims that it
actually delivers outside of installations run by the developers themselves.
Specifically: multi-DC, hinted handoff, compaction, dynamic cluster resizing
are all fail. The developers will adamantly claim all such features work
just fine. Good luck getting any of it to work in YOUR environment.

Where to start with this statement:
Multi-DC support:
You are saying cassandra is bad at X but hbase does not even do X.
https://issues.apache.org/jira/browse/HBASE-1295

Hinted Handoff:
If i take down a cassandra node hints get delivered to other nodes.
When the failed node comes back online the hints are delivered.

Compaction:
Compaction works. My tables compact at user defined intervals.

Dynamic Cluster Resizing:
Joining a new node is more intensive in cassandra as data has to
physically move from physical node to another. Yet, I regularly add,
replace, and move nodes.

You said:
Talks about technical differences are really noise, because they're
entirely theoretical.

This statement is contradictory. You are saying technical differences
are theoretical. Small technical differences have profound
implications.

You said:
In stark contrast, I am intimately familiar with at least one large HBase
installation run by non-developers 

Re: Fw: namenode crash

2010-08-13 Thread Edward Capriolo
On Fri, Aug 13, 2010 at 3:03 PM, Ryan Rawson  wrote:
> We don't use centos here at Stumbleupon... your version looks quite
> old!  Our uname looks like:
>
> Linux host 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC
> 2009 x86_64 GNU/Linux
>
> I'd consider using something newer than 2.6.18!
>
> On Fri, Aug 13, 2010 at 11:54 AM, Jean-Daniel Cryans
>  wrote:
>> u18 should never be used.
>>
>> You say it's crashing on both u17 and u20? How is it crashing? (it's
>> kind of a vague word)
>>
>> Here with use both u14 and u17 on 20 nodes clusters without any issue.
>>
>> J-D
>>
>> On Fri, Aug 13, 2010 at 11:27 AM, Jinsong Hu  wrote:
>>>
>>>
>>> Hi, There:
>>>  does anybody know of a good combination of centos version and jdk version 
>>> that works stably ? I am using centos version
>>>
>>> Linux  2.6.18-194.8.1.el5.centos.plus #1 SMP Wed Jul 7 11:45:38 EDT 2010
>>>  x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> jdk version
>>> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
>>> Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
>>>
>>> and run the namenode with the following jvm config
>>> -Xmx1000m  -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
>>> -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError 
>>> -XX:+UseCompressedOops -XX:+DoEscapeAnalysis -XX:+AggressiveOpts  -Xmx2G
>>>
>>> but it crashed silently after 16 hours.
>>>
>>> I used jdk
>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>
>>> with the same jvm config, and the namenode crashed in about 1 week. I 
>>> searched internet and people say 1.6.0_18 is not good.
>>> but does anybody can recommend a good combination of jdk and os version 
>>> that can run stably ?
>>>
>>>
>>> This crashing problem doesn't happen with a small cluster of 4 datanodes. 
>>> but it happens with a cluster of 17 datanodes.
>>>
>>> Jimmy.
>>>
>>>
>>>
>>
>

RedHat/CentOS backport kernel patches and attempt to keep the minor
number relatively stable.

Something like 2.6.18-194 is probably closer to 2.6.28 then 2.6.18.

Do you have any more free memory? Maybe for fun raise you -Xmx4G.

Edward


Re: Secondary Index versus Full Table Scan

2010-08-03 Thread Edward Capriolo
On Tue, Aug 3, 2010 at 11:40 AM, Luke Forehand
 wrote:
> Thanks to the help of people on this mailing list and Cloudera, our team has
> managed to get our 3 data node cluster with HBase running like a top.  Our
> import rate is now around 3 GB per job which takes about 10 minutes.  This is
> great.  Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full 
> table
> scan of ~150 million records takes ~1 hour.  This won't work for our use case,
> because not only are we continuing to add more data to this table, but we are
> asking many more questions in a day.  To increase performance, the first 
> thought
> was to use a secondary index table, and do range scans of the secondary index
> table, iteratively performing GET operations of the master table.
>
> In testing the average GET operation took 37 milliseconds.  At that rate with 
> 24
> mappers it would take ~1.5 hours to scan 3 million rows.  This still seems 
> like
> a lot of time.  37 milliseconds per GET is nice for "real time" access from a
> client, but not during massive GETs of data in a map reduce job.
>
> My question is, does it make sense to use secondary index tables in a map 
> reduce
> job of this scale?  Should we not be using HBase for input in these map reduce
> jobs and go with raw SequenceFile?  Do we simply need more nodes?
>
> Here are the specs for each of our 3 data nodes:
> 2x CPU (2.5 GHZ nehalem ep quad core)
> 24 GB RAM (4gb / region server )
> 4x 1tb hard drives
>
> Region size: 1GB
>
> Thanks,
>
> Luke Forehand
> Software Engineer
> http://www.networkedinsights.com
>
>

Generally speaking: If you are doing full range scans of a table
indexes will not help. Adding indexes will make the performance worse,
it will take longer to load your data and now fetching the data will
involve two lookups instead of one.

If you are doing full range scans adding more nodes should result in
linear scale up.


Re: Memory Consumption and Processing questions

2010-08-02 Thread Edward Capriolo
On Mon, Aug 2, 2010 at 11:33 AM, Jacques  wrote:
> You're right, of course.  I shouldn't generalize too much.  I'm more trying
> to understand the landscape than pinpoint anything specific.
>
> Quick question: since the block cache is unaware of the location of files,
> wouldn't it overlap the os cache for hfiles once they are localized after
> compaction?  Any guidance on how to tune the two?
>
> thanks,
> Jacques
>
> On Sun, Aug 1, 2010 at 9:08 PM, Jonathan Gray  wrote:
>
>> One reason not to extrapolate that is that leaving lots of memory for the
>> linux buffer cache is a good way to improve overall performance of typically
>> i/o bound applications like Hadoop and HBase.
>>
>> Also, I'm unsure that "most people use ~8 for hdfs/mr".  DataNodes
>> generally require almost no significant memory (though generally run with
>> 1GB); their performance will improve with more free memory for the os buffer
>> cache.  As for MR, this completely depends on the tasks running.  The
>> TaskTrackers also don't require significant memory, so this completely
>> depends on the number of tasks per node and the memory requirements of the
>> tasks.
>>
>> Unfortunately you can't always generalize the requirements too much,
>> especially in MR.
>>
>> JG
>>
>> > -Original Message-
>> > From: Jacques [mailto:whs...@gmail.com]
>> > Sent: Sunday, August 01, 2010 5:30 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Memory Consumption and Processing questions
>> >
>> > Thanks, that was very helpful.
>> >
>> > Regarding 24gb-- I saw people using servers with 32gb of server memory
>> > (a
>> > recent thread here and hstack.org).  I extrapolated the use since it
>> > seems
>> > most people use ~8 for hdfs/mr.
>> >
>> > -Jacques
>> >
>> >
>> > On Sun, Aug 1, 2010 at 11:39 AM, Jonathan Gray 
>> > wrote:
>> >
>> > >
>> > >
>> > > > -Original Message-
>> > > > From: Jacques [mailto:whs...@gmail.com]
>> > > > Sent: Friday, July 30, 2010 1:16 PM
>> > > > To: user@hbase.apache.org
>> > > > Subject: Memory Consumption and Processing questions
>> > > >
>> > > > Hello all,
>> > > >
>> > > > I'm planning an hbase implementation and had some questions I was
>> > > > hoping
>> > > > someone could help with.
>> > > >
>> > > > 1. Can someone give me a basic overview of how memory is used in
>> > Hbase?
>> > > >  Various places on the web people state that 16-24gb is the minimum
>> > for
>> > > > region servers if they also operate as hdfs/mr nodes.  Assuming
>> > that
>> > > > hdfs/mr
>> > > > nodes consume ~8gb that leaves a "minimum" of 8-16gb for hbase.  It
>> > > > seems
>> > > > like lots of people suggesting use of even 24gb+ for hbase.  Why so
>> > > > much?
>> > > >  Is it simply to avoid gc problems?  Have data in memory for fast
>> > > > random
>> > > > reads? Or?
>> > >
>> > > Where exactly are you reading this from?  I'm not actually aware of
>> > people
>> > > using 24GB+ heaps for HBase.
>> > >
>> > > I would not recommend using less than 4GB for RegionServers.  Beyond
>> > that,
>> > > it very much depends on your application.  8GB is often sufficient
>> > but I've
>> > > seen as much as 16GB used in production.
>> > >
>> > > You need at least 4GB because of GC.  General experience has been
>> > that
>> > > below that the CMS GC does not work well.
>> > >
>> > > Memory is used primarily for the MemStores (write cache) and Block
>> > Cache
>> > > (read cache).  In addition, memory is allocated as part of normal
>> > operations
>> > > to store in-memory state and in processing reads.
>> > >
>> > > > 2. What types of things put more/less pressure on memory?  I saw
>> > > > insinuation
>> > > > that insert speed can create substantial memory pressure.  What
>> > type of
>> > > > relative memory pressure do scanners, random reads, random writes,
>> > > > region
>> > > > quantity and compactions cause?
>> > >
>> > > Writes are buffered and flushed to disk when the write buffer gets to
>> > a
>> > > local or global limit.  The local limit (per region) defaults to
>> > 64MB.  The
>> > > global limit is based on the total amount of heap available (default,
>> > I
>> > > think, is 40%).  So there is interplay between how much heap you have
>> > and
>> > > how many regions are actively written to.  If you have too many
>> > regions and
>> > > not enough memory to allow them to hit the local/region limit, you
>> > end up
>> > > flushing undersized files.
>> > >
>> > > Scanning/random reading will utilize the block cache, if configured
>> > to.
>> > >  The more room for the block cache, the more data you can keep in-
>> > memory.
>> > >  Reads from the block cache are significantly faster than non-cached
>> > reads,
>> > > obviously.
>> > >
>> > > Compactions are not generally an issue.
>> > >
>> > > > 2. How cpu intensive are the region servers?  It seems like most of
>> > > > their
>> > > > performance is based on i/o.  (I've noted the caution in starving
>> > > > region
>> > > > servers of cycles--which seems primarily focu

Re: Thousands of tablesq

2010-07-30 Thread Edward Capriolo
On Fri, Jul 30, 2010 at 12:41 PM, Jean-Daniel Cryans
 wrote:
>> I see. Usually a whole customer fits within a region. Actually, the
>> number of customers that doesn't fit in a single region are only two or 
>> three.
>>
>> But then another question comes up. Even if a put all the data in a single
>> table, given that the keys are written in order, and given that several
>> customers can fit in the same region, I'd had the exact same problem right?
>> I mean, if data from customer A to D sits in the same region within the same
>> table, the result is worse than having 4 different tables, as those can 
>> actually
>> sit in another region server right?
>>
>> Is there a way to move a region manually to another machine?
>
> If you expect that some contiguous rows would be really overused, then
> change the row key. UUIDs for example would spread them all over the
> regions.
>
> In 0.20 you can do a close_region in the shell, that will move the
> region to the first region servers that checks. In 0.90 we are working
> on better load balancing, more properly tuned to region traffic.
>
>>
>>> Client side? I don't believe so, there's almost nothing kept in memory.
>>
>> Even if all the htables are opened at the same time?
>>
>
> The only connections kept are with region servers, not with "tables",
> so even if you have 1k of them it's just 999 more objects to keep in
> memory (compared to the single table design). If you are afraid that
> it would be too much, you can shard the web servers per clients. In
> any case, why not test it?
>
> J-D
>

Usually people have gone to the "table per customer" approach in the
RDMS world would did this
because their Database did not offer built in partitioning or they
wanted to offer Quality Of Service type features such as
"high paying customers go to new fancy servers".

I feel this approach is somewhat contradictory do the scaling model.
It also introduces the issue of managing the change across X
instances. Which cloud serving systems the schema is typically more
simplistic but replicating changes X times could still be an issue.

There exists a hybrid approach which I borrow from hive bucketing.
Rather then make one partition for each customer, bucket those
customer by calculating a hash id mod 64. Customer distributed
randomly across the 64 buckets and by randomness small customers and
large customers balance out.

I do not like "table per customer" or the bucket idea I introduced for
noSQL, I see it causing X times the pressure on the NameNode, I see it
causing x times the work in all over monitoring, your application
servers will now be caching X HTable connections (does not seem
possible)

"Early optimization is the root of much evil"


Re: restarting region server which shutdown due to GC pause

2010-07-21 Thread Edward Capriolo
On Wed, Jul 21, 2010 at 1:40 PM, Ted Yu  wrote:
> J-D:
> Can you elaborate why ssh isn't preferred ?
>
> One solution is to create a light weight Java process that monitors region
> server log and perform this duty.
>
> On Wed, Jul 21, 2010 at 10:31 AM, Jean-Daniel Cryans 
> wrote:
>
>> The stance in the hbase dev community on that issue is to let users'
>> cluster management tools handle it.
>>
>> Also, how would you start a java process on a remote machine (if still
>> alive) from another java process (the master)? If you can find an
>> elegant way that doesn't rely on SSH, this would be a nice
>> contribution that users could enable.
>>
>> J-D
>>
>> On Wed, Jul 21, 2010 at 10:25 AM, Ted Yu  wrote:
>> > Thanks for the answer.
>> >
>> > GC pause seems to be a major cause for region server to come down:
>> > 2010-07-21 09:07:14,138 WARN org.apache.hadoop.hbase.util.Sleeper: We
>> slept
>> > 291505ms, ten times longer than scheduled: 1
>> >
>> > Is it possible for HBase Master to restart dead region server in this
>> case ?
>> >
>> > On Wed, Jul 21, 2010 at 10:02 AM, Jean-Daniel Cryans <
>> jdcry...@apache.org>wrote:
>> >
>> >> HBaseAdmin.getClusterStatus().getServers()
>> >>
>> >> J-D
>> >>
>> >> On Wed, Jul 21, 2010 at 9:56 AM, Ted Yu  wrote:
>> >> > Hi,
>> >> > Is there API to query the number of live region servers ?
>> >> >
>> >> > Thanks
>> >> >
>> >>
>> >
>>
>

Ted,
I just wrote to very relevant articles in my blog

Using func instead of SSH keys
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/func_hadoop_the_end_of

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/hadoop_secondarynamenode_puppet

In which I use the puppet 'service  resource' to do automatic restarts
of a secondary namenode. You could use puppet to accomplish auto
restarts for a region server.

I usually tweet my new recipes when I come out with
themhttp://twitter.com/@edwardcapriolo


Re: HBase 0.89 and JDK version

2010-07-20 Thread Edward Capriolo
On Tue, Jul 20, 2010 at 10:54 PM, Jonathan Gray  wrote:
> You think it enough to also include u20 with u18 as not recommended?  I 
> haven't heard anything else about u20 but maybe there's more issues out there.
>
>> -Original Message-
>> From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
>> Sent: Tuesday, July 20, 2010 7:49 PM
>> To: user@hbase.apache.org
>> Subject: Re: HBase 0.89 and JDK version
>>
>> On Tue, Jul 20, 2010 at 10:31 PM, Jonathan Gray 
>> wrote:
>> >> I get lots of sigsegs on u21. Had better luck with u21.
>> >
>> > Ed, which is the typo? :)
>> >
>>
>> Doh sorry,
>>  (this was cassandra) I was running.
>>      -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+UseCompressedOops \
>>         -XX:SurvivorRatio=8 \
>>         -XX:MaxTenuringThreshold=1 \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>> with 1.6.0_u20.
>>
>> I had some data that was causing sigseg as only a a few nodes were
>> constantly suffering. I never found what the data was but after moving
>> to u21 it never happened again.
>

It happened to me, but that does not mean it is a widespread issue. I
feel that NoSQL pushes JVMs hard.  These type of JVM failures do not
end up in your log4j logs typically. I had completely overlooked JVM
and was trying to troubleshoot at a much higher level. Switching JVMs
can be as simple as moving some symlinks. While I do not believe
upgrading/downgrading JVM should be your first recourse for
troubleshooting you should consider it at some point.

For reference, sigsegv appears multiple times in the 21 bug fix info.
http://java.sun.com/javase/6/webnotes/BugFixes6u21.html


Re: HBase 0.89 and JDK version

2010-07-20 Thread Edward Capriolo
On Tue, Jul 20, 2010 at 10:31 PM, Jonathan Gray  wrote:
>> I get lots of sigsegs on u21. Had better luck with u21.
>
> Ed, which is the typo? :)
>

Doh sorry,
 (this was cassandra) I was running.
 -XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:+UseCompressedOops \
-XX:SurvivorRatio=8 \
-XX:MaxTenuringThreshold=1 \
-XX:+HeapDumpOnOutOfMemoryError \
with 1.6.0_u20.

I had some data that was causing sigseg as only a a few nodes were
constantly suffering. I never found what the data was but after moving
to u21 it never happened again.


Re: HBase 0.89 and JDK version

2010-07-20 Thread Edward Capriolo
On Tue, Jul 20, 2010 at 7:27 PM, Jonathan Gray  wrote:
> You are seeing this error because JDK 1.6u18 has known issues, as the message 
> describes :)
>
> Pig and Hive apparently do not do this check.
>
> You should upgrade or downgrade your JVM.
>
>> -Original Message-
>> From: Syed Wasti [mailto:mdwa...@hotmail.com]
>> Sent: Tuesday, July 20, 2010 4:23 PM
>> To: user@hbase.apache.org; hbase-u...@hadoop.apache.org
>> Subject: HBase 0.89 and JDK version
>>
>>
>> Hi,
>>
>> We recently upgraded our QA cluster to Cloudera Version 3 (CDH3) which
>> has Hbase 0.89. Our cluster is running on JDK 1.6.0_18 version. On
>> trying to start up Hbase it basically gives an error "you're running
>> jdk 1.6.0_18 which has known bugs" even though Pig and Hive seems to
>> work fine with the version of JDK.
>>
>> Any thoughts on why I am seeing this error ?
>>
>> If there is a bug in this JDK version then what is recommended,
>> upgrading JDK to 19 or 20 or 21 (21 release this month) or downgrade
>> the jdk version ?
>>
>>
>>
>> Thanks for the support.
>>
>>
>>
>> Regards
>>
>> -SW
>>
>>
>>
>> #java -version
>>
>> java version "1.6.0_18"
>>
>> Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
>>
>> Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
>>
>>
>>
>>
>>
>>
>>
>

I get lots of sigsegs on u21. Had better luck with u21.


Re: web applications and multiple db calls per page view

2010-07-14 Thread Edward Capriolo
On Wed, Jul 14, 2010 at 11:30 AM, S Ahmed  wrote:
> I haven't built it yet, I am in the planning stages as I don't have a solid
> grasp of distributing databases just yet.
>
> I was just going through an average page lifecycyle, and was just concerned
> at a high level on what I should be looking out for and what expectations I
> should have.
>
>>> Or are you
>>>just asking if you saw that theoretical latency what would you do
>>>next? Future troubleshooting ?
> Well I have been reading some benchmarks, and if say the average query takes
> 30ms and I am making 10 calls, things can start adding up so I was concerned
> yes.
>
>
> On Wed, Jul 14, 2010 at 11:24 AM, Edward Capriolo 
> wrote:
>
>> On Wed, Jul 14, 2010 at 11:12 AM, S Ahmed  wrote:
>> > If I build a forum application like vbulletin on top on hbase, I can
>> forsee
>> > that each page view will have multiple calls to hbase.
>> >
>> > Let's picture someone viewing a particular thread (question) on the
>> website:
>> >
>> > 0. load the current website object (remember this is a multi-tenancy
>> > application, Saas forum application).
>> > 1. verify user is logged in.
>> > 2. get user profile / permissions if logged in
>> > 3. get current forum object
>> > 4. get current thread object
>> > 5. get all replies to thread
>> >
>> > And probably 2-3 calls will come into the picture as i progress.
>> >
>> > If each call takes 100ms, this isn't going to be a 'snappy' response now
>> is
>> > it?
>> >
>> > Or is there a way to batch calls to minimize the chattyness?
>> >
>>
>> If you are pooling/reusing hbase objects intelligently have a good
>> schema design, good caching, and enough hardware you should not be
>> seeing requests take that long. I do not want to promise 1ms
>> performance on every request but that is pretty common.
>>
>> You started your message with  'If I build a forum application'. Have
>> you done this? If so are you actually seeing 100ms latency? Or are you
>> just asking if you saw that theoretical latency what would you do
>> next? Future troubleshooting ?
>>
>

If you want to discuss performance please reference what benchmark you
are talking about. A popular benchmark is:
http://www.brianfrankcooper.net/pubs/ycsb.pdf

As the amount of data per node grows, and the requests per node grow,
performance will degrade. However Hbase is a scale out architecture,
so you combat data per node and requests per node by adding more
nodes. See the attached PDF for Elastic speedup, scale-out.


Re: web applications and multiple db calls per page view

2010-07-14 Thread Edward Capriolo
On Wed, Jul 14, 2010 at 11:12 AM, S Ahmed  wrote:
> If I build a forum application like vbulletin on top on hbase, I can forsee
> that each page view will have multiple calls to hbase.
>
> Let's picture someone viewing a particular thread (question) on the website:
>
> 0. load the current website object (remember this is a multi-tenancy
> application, Saas forum application).
> 1. verify user is logged in.
> 2. get user profile / permissions if logged in
> 3. get current forum object
> 4. get current thread object
> 5. get all replies to thread
>
> And probably 2-3 calls will come into the picture as i progress.
>
> If each call takes 100ms, this isn't going to be a 'snappy' response now is
> it?
>
> Or is there a way to batch calls to minimize the chattyness?
>

If you are pooling/reusing hbase objects intelligently have a good
schema design, good caching, and enough hardware you should not be
seeing requests take that long. I do not want to promise 1ms
performance on every request but that is pretty common.

You started your message with  'If I build a forum application'. Have
you done this? If so are you actually seeing 100ms latency? Or are you
just asking if you saw that theoretical latency what would you do
next? Future troubleshooting ?


Re: why minimum 5 servers?

2010-07-14 Thread Edward Capriolo
On Wed, Jul 14, 2010 at 6:24 AM, S Ahmed  wrote:
> Is there a reason why 5 is the recommend number of servers (minimum) in a
> cluster?
>
> Why not 2 or 3?
>
> Just asking because 5 large ec2 instances (7.5gb ram) isn't *that* cheap :)
>
> Thanks!
>

Most of the answer is tied into the architecture of Hadoop. Normally
most set dfs.replication to 3. You are not going to want your namenode
to run on the same physical hardware as your DataNode, so that is
already 4. You may want a dedicated zookeeper and hbase master so that
is 5.

However performance wise if you dfs.replication = 3, at 3 nodes you do
not have that 'critical mass' of servers for the scale out effect. At
replication 3 and number of nodes 3 every action (put get) has some
affect on all servers. If you have replication 3 and 10 nodes a single
put or get only roughly effects 30% of your cluster. 3/100 3%...and so
on


Re: Need help trying to balance HBase RegionServer load

2010-06-17 Thread Edward Capriolo
On Thu, Jun 17, 2010 at 11:58 AM, Daniel Einspanjer  wrote:

>  Here is an example of a region split with both daughters being assigned to
> the same region.  Is this expected?
>
> 2010-06-17 08:34:53,060 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276776160508:
> Daughters;
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276788891647,
> crash_reports,21006172b7ec9f5-dcad-4c98-9dc5-969532100617,1276788891647 from
> cm-hadoop14.mozilla.org,60020,1276560962019; 1 of 1
> 2010-06-17 08:34:54,316 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276788891647 to
> cm-hadoop15.mozilla.org,60020,1276778868841
> 2010-06-17 08:34:54,316 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region
> crash_reports,21006172b7ec9f5-dcad-4c98-9dc5-969532100617,1276788891647 to
> cm-hadoop15.mozilla.org,60020,12767788688412010-06-17 08:34:55,432 INFO
> org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN:
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276788891647 from
> cm-hadoop15.mozilla.org,60020,1276778868841;
> 1 of 1
> 2010-06-17 08:34:55,432 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276788891647 open
> on 10.2.72.74:60020
> 2010-06-17 08:34:55,436 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> crash_reports,21006172700f355-1d02-485a-90d9-0e8182100617,1276788891647 in
> region .META.,,1 with startcode=1276778868841, server=1
> 0.2.72.74:60020
> 2010-06-17 08:34:56,044 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN:
> crash_reports,21006172b7ec9f5-dcad-4c98-9dc5-969532100617,1276788891647 from
> cm-hadoop15.mozilla.org,60020,1276778868841;
> 1 of 1
> 2010-06-17 08:34:56,044 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> crash_reports,21006172b7ec9f5-dcad-4c98-9dc5-969532100617,1276788891647 open
> on 10.2.72.74:60020
> 2010-06-17 08:34:56,048 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> crash_reports,21006172b7ec9f5-dcad-4c98-9dc5-969532100617,1276788891647 in
> region .META.,,1 with startcode=1276778868841, server=1
> 0.2.72.74:60020
>
>
>
> On 6/17/10 11:42 AM, Daniel Einspanjer wrote:
>
>>  Currently, in our production cluster, almost all of the traffic for a day
>> ends up assigned to a single RS and that causes the load on that machine to
>> be too high.
>>
>> With our last release, we salted our rowkeys so that rather than starting
>> with the date:
>> 100617
>> 
they now start with the first letter of the guid followed by the date:
>> 
e100617
>>
>> When I look at the region assignments though, I see a single server
>> assigned the following regions:
>> 
0100617...
>> 
1100617...
>> 
2100617...
>> 
3100617...
>> 
4100617...
>> 
...
>> 
d100617...
>> 
e100617...
>> 
f100617...
>>
>> Is there anything we can do to try to get the cluster to shuffle this up
>> some more?
>> We are getting compaction times in the minutes (one I saw was over 12
>> minutes) and this causes our clients to time out and shut down which causes
>> production outages.
>>
>> -Daniel
>>
>
Here comes a stone age, stop gap suggestion. If you shutdown the region
server you would get them to move, but there is a period of time where the
region is inaccessible so that is never good.


Re: Big machines or (relatively) small machines?

2010-06-10 Thread Edward Capriolo
On Thu, Jun 10, 2010 at 2:27 PM, Buttler, David  wrote:

> It turns out that we just received a quote from a supplier where a rack of
> 2U 128 GB machines with 16 cores (4x4 I think) and 8 1TB disks is cheaper
> than a rack of 1U machines with exactly half the spec (64 GB RAM, 8 core, 4
> 1TB disks).  My initial thought was that it would be better to have the 2U
> machines as it would give us more flexibility if we wanted to have some
> map/reduce jobs that use more than 8 GB per map task.
> The only worry is how it would affect HBase.  Would it be better to have 20
> region servers with a 16GB heap and 2 dedicated cores, or 40 region servers
> with a 8GB heap and one core? [Of course I realize we can't dedicate a core
> to a region server, but we can limit the number of map/reduce jobs so that
> there would be no more than 14 or 7 of them depending on the configuration]
>
> Finally, it seems like there are a bunch of related parameters that make
> sense to change together depending on heap size and avg row size.  Is there
> a single place that describes the interrelatedness of the parameters so that
> I don't have to guess or reconstruct good settings from 10-100 emails on the
> list?  If I understood the issues I would be happy to write it up, but I am
> afraid I don't.
>
> Thanks,
> Dave
>
> -Original Message-
> From: Ryan Rawson [mailto:ryano...@gmail.com]
> Sent: Monday, June 07, 2010 10:51 PM
> To: user@hbase.apache.org
> Subject: Re: Big machines or (relatively) small machines?
>
> I would take it one notch smaller, 32GB ram per node is probably more
> than enough...
>
> It would be hard to get full utilization of 128GB ram, and maybe even
> 64GB.  With 32GB you might even be able to get 2GB dimms (much
> cheaper).
>
> -ryan
>
> On Mon, Jun 7, 2010 at 10:48 PM, Sean Bigdatafun
>  wrote:
> > On Mon, Jun 7, 2010 at 1:13 PM, Todd Lipcon  wrote:
> >
> >> If those are your actual specs, I would definitely go with 16 of the
> >> smaller
> >> ones. 128G heaps are not going to work well in a JVM, you're better off
> >> running with more nodes with a more common configuration.
> >>
> >
> > I am not using one JVM on a machine, right? Each Map/Reduce task use one
> > JVM, I believe. And actually, my question can really be boiled down to
> > whether the current map/reduce scheduler is smart enough to make best use
> of
> > resources. If it is smart enough, I think virtualization does not make
> too
> > much sense; if it's not smart enough, I guess virtualization may help to
> > improve performance.
> >
> > But you are right,  here I was really making up a case -- "128G mem" is
> just
> > the number doubling the "smaller machine"'s memory.
> >
> >
> >
> >>
> >> -Todd
> >>
> >> On Mon, Jun 7, 2010 at 1:46 PM, Jean-Daniel Cryans  >> >wrote:
> >>
> >> > It really depends on your usage pattern, but there's a balance wrt
> >> > cost VS hardware you must achieve. At StumbleUpon we run with 2xi7,
> >> > 24GB, 4x 1TB and it works like a charm. The only thing I would change
> >> > is maybe more disks/node but that's pretty much it. Some relevant
> >> > questions:
> >> >
> >> >  - Do you have any mem-intensive jobs? If so, figure how many tasks
> >> > you'll run per node and make the RAM fit the load.
> >> >  - Do you plan to serve data out of HBase or will you just use it for
> >> > MapReduce? Or will it be a mix (not recommended)?
> >> >
> >> > Also, keep in mind that losing 1 machine over 8 compared to 1 over 16
> >> > drastically changes the performance of your system at the time of the
> >> > failure.
> >> >
> >> > About virtualization, it doesn't make sense. Also your disks should be
> in
> >> > JBOD.
> >> >
> >> > J-D
> >> >
> >> > On Wed, Jun 2, 2010 at 11:12 PM, Sean Bigdatafun
> >> >  wrote:
> >> > > I am thinking of the following problem lately. I started thinking of
> >> this
> >> > > problem in the following context.
> >> > >
> >> > > I have a predefined budget and I can either
> >> > >  -- A) purchase 8 more powerful servers (4cpu x 4 cores/cpu +  128GB
> >> mem
> >> > +
> >> > > 16 x 1TB disk) or
> >> > >  -- B) purchase 16 less powerful servers(2cpu x 4 cores/cpu +  64GB
> mem
> >> +
> >> > 8
> >> > > x 1TB disk)
> >> > >  NOTE: I am basically making up a half housepower scenario
> >> > >  -- Let's say I am going to use 10Gbps network switch and each
> machine
> >> > has
> >> > > a 10Gbps network card
> >> > >
> >> > > In the above scenario, does A or B perform better or relatively
> same?
> >> --
> >> > I
> >> > > guess this really depends on Hadoop's map/reduce's scheduler.
> >> > >
> >> > > And then I have a following question: does it make sense to
> virtualize
> >> a
> >> > > Hadoop datanode at all?  (if the answer to above question is
> >> "relatively
> >> > > same", I'd say it does not make sense)
> >> > >
> >> > > Thanks,
> >> > > Sean
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
>

If you based your hardware purchase on someone e

Re: RowCounter example run time

2010-05-23 Thread Edward Capriolo
On Sun, May 23, 2010 at 10:36 AM, Michael Segel
wrote:

>
> J-D,
>
> Here's the problem.. you go to any relational database and do a select
> count(*) and you get a response back fairly quickly.
> The difference is that in HBase, you're doing a physical count and with the
> relational engine you're pulling it from meta data.
>
> I have a couple of ideas on how we could do this...
>
> -Mike
>
> > Date: Sat, 22 May 2010 09:25:51 -0700
> > Subject: Re: RowCounter example run time
> > From: jdcry...@apache.org
> > To: user@hbase.apache.org
> >
> > My first question would be, what do you expect exactly? Would 5 min be
> > enough? Or are you expecting something more like 1-2 secs (which is
> > impossible since this is mapreduce)?
> >
> > Then there's also Jon's questions.
> >
> > Finally, did you set a higher scanner caching on that job?
> > hbase.client.scanner.caching is the name of the config, which defaults
> > to 1. When mapping a HBase table, if you don't set it higher you're
> > basically benchmarking the RPC layer since it does 1 call per next()
> > invocation. Setting the right value depends on the size of your rows
> > eg are you storing 60 bytes or something high like 100KB? On our 13B
> > rows table (each row is a few bytes), we set it to 10k.
> >
> > J-D
> >
> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen
> >  wrote:
> > > Hello,
> > >
> > > I finally got some decent hardware to put together a 1 master, 4 slave
> Hadoop/HBase cluster.  However, I'm still waiting for space in the
> datacenter to clear out and only have 3 of the nodes deployed (master + 2
> slaves).  Each node is a quad-core AMD with 8G of RAM, running on a GigE
> network.  HDFS is configured to run on a separate (from the OS drive) U320
> drive.  The master has RAID1 mirrored drives only.
> > >
> > > I've installed HBase with slave1 and slave2 as regionservers and
> master, slave1, slave2 as the ZK quorom.  The master serves as the NN and JT
> and the slaves as DN and TT.
> > >
> > > Now my question:
> > >
> > > I've imported 22.5M rows into HBase, into a single table.  Each row has
> 8 or so columns.  I just ran the RowCounter MR example and it takes about 25
> minutes to complete.  Is a 3 node setup too underpowered to combat the
> overhead of Hadoop and HBase?  Or, could it be something with my
> configuration?  I've been playing around with Hadoop some but this is my
> first attempt at anything HBase.
> > >
> > > Thanks!
> > >
> > > --Andrew
>
> _
> The New Busy is not the too busy. Combine all your e-mail accounts with
> Hotmail.
>
> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
>

Every system has its tradeoff. In the example above:

>> select count(*) and you get a response back fairly quickly.

Try this with my isam very fast. Try that will innodb, this takes a very
long time. Some systems maintain a row count and some do not.

Now if you are using innodb there is a quick way to get an approximate row
count.

explain select count(*)

This causes the innodb engine to use indexes for an approximate table size.

HBase does not maintain a row count. The row count is intensive process as
it scans every row. Such is life.