Is it possible to get the table name at the Map phase?

2014-11-21 Thread yonghu
Hello all, I want to implement difference operator by using MapReduce. I read two tables by using MultiTableInputFormat. In the Map phase, I need to tag the name of table into each row, but how can I get the table name? One way I can think is to create HTable instance for each table in the

Current Deployment Sizes

2014-11-21 Thread Julian Wissmann
Hi, I'm currently writing my thesis, in part it is about HBase. I was wondering if there are some current numbers for large deployments, i.e Facebook or Yahoo. I'm particularly interested in things like number of nodes, amount of data managed and (if available) query throughput. The most recent

Re: Current Deployment Sizes

2014-11-21 Thread Ted Yu
Have you looked at http://www.meetup.com/hbaseusergroup/files/ ? I think the following talks are relevant to your thesis: HBase-at-twitter http://files.meetup.com/1350427/HBase-at-twitter.pdf HBase Sizing Notes http://files.meetup.com/1350427/HBase%20Sizing%20Notes.pdf On Fri, Nov 21, 2014 at

Re: Is it possible to get the table name at the Map phase?

2014-11-21 Thread Ted Yu
This question has been asked a few times. Take a look at Nick's comment in HBASE-4587 Cheers On Fri, Nov 21, 2014 at 4:54 AM, yonghu yongyong...@gmail.com wrote: Hello all, I want to implement difference operator by using MapReduce. I read two tables by using MultiTableInputFormat. In the

Re: Current Deployment Sizes

2014-11-21 Thread Julian Wissmann
Hi, thank you! The meetup link comes in handy. However this is not the answer to the question I asked (or maybe I wasn't clear enough). I am well aware of the sizing notes etc. However what I am looking for are some hard numbers considering actual scale in the rela world. I can write a lot about

RE: Current Deployment Sizes

2014-11-21 Thread Birdsall, Dave
Hi Julian, I don't have an answer to your question, but I want to better understand your question: You are looking for data on the largest HBase deployments in practice, correct? Regards, Dave -Original Message- From: Julian Wissmann [mailto:julianwissm...@gmail.com] Sent: Friday,

Re: Current Deployment Sizes

2014-11-21 Thread Shahab Yunus
I think your best bet, to get the latest and accurate as possible data, would be to directly contact the companies (through their Engineering channels) which are known to host large clusters. Most of these companies have public blogs and such so should not be hard to find an appropriate contact.

Re: Current Deployment Sizes

2014-11-21 Thread Julian Wissmann
Exactly! Regards, Julian 2014-11-21 17:16 GMT+01:00 Birdsall, Dave dave.birds...@hp.com: Hi Julian, I don't have an answer to your question, but I want to better understand your question: You are looking for data on the largest HBase deployments in practice, correct? Regards, Dave

Re: Current Deployment Sizes

2014-11-21 Thread Ted Yu
Take a look at slide #4 in this talk: http://www.slideshare.net/ddlatham/hbase-at-flurry Cheers On Fri, Nov 21, 2014 at 7:43 AM, Julian Wissmann julianwissm...@gmail.com wrote: Hi, thank you! The meetup link comes in handy. However this is not the answer to the question I asked (or maybe I

Thrift getRows API with column filtering?

2014-11-21 Thread JM Tremblay
Hi, I'm using the node.js HBase Thrift client. I can use getRows() to fetch specific rows with all their columns or getRowsWithColumns() to specify the columns or column families to return. But I can't figure out how to specify columns starting with a given prefix, as it seems to be possible

balance the tables across region servers

2014-11-21 Thread Arul Ramachandran
Hello, HBase 0.96 I notice couple of our tables are on only one of the region servers and that one is doing magnitudes of requests per sec compared to others. Will setting hbase.master.loadbalance.bytable to true help this situation? Also, if that is the case, wondering why this is not set to

Re: balance the tables across region servers

2014-11-21 Thread Ted Yu
Please take a look at TableSkewCostFunction in StochasticLoadBalancer (the default balancer): private static final String TABLE_SKEW_COST_KEY = hbase.master.balancer.stochastic.tableSkewCost; private static final float DEFAULT_TABLE_SKEW_COST = 35; You can increase the value

Re: how to explain read/write performance change after modifying the hfile.block.cache.size?

2014-11-21 Thread Nick Dimiduk
400mb blockcache? Ouch. What's your hbase-env.sh? Have you configured a heap size? My guess is you're using the un configured default of 1G. Should be at least 8G, and maybe more like 30G with this kind of host. How many users are sharing it and with what kinds of tasks? If there's no IO

Re: balance the tables across region servers

2014-11-21 Thread Arul Ramachandran
Hi Ted, You suggest this because StochasticLoadBalancer is the default in 0.96 ? What about setting hbase.master.loadbalance.bytable to true ? Thanks, Arul On Fri, Nov 21, 2014 at 1:47 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at TableSkewCostFunction in StochasticLoadBalancer

Re: balance the tables across region servers

2014-11-21 Thread Ted Yu
bq. StochasticLoadBalancer is the default in 0.96 True. Adjusting hbase.master.loadbalance.bytable is not recommended in 0.96+ Cheers On Fri, Nov 21, 2014 at 3:14 PM, Arul Ramachandran arkup...@gmail.com wrote: Hi Ted, You suggest this because StochasticLoadBalancer is the default in 0.96

Error and warnings on HBaseTestingUtil.shutdownMiniCluster

2014-11-21 Thread Stephen Boesch
HI, I have created a testcase that includes using the HBaseTestingUtility.startMiniCluster and .shutdownMiniCluster. The test passes, but the shutdownMiniCluster is not clean. The following is the output. There are two questions about the error/warnings: a) Is there a way to fix the Error