Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Matt Corgan
Not that it's a long-term solution, but try major-compacting before running the benchmark. If the LSM tree is CPU bound in merging HFiles/KeyValues through the PriorityQueue, then reducing to a single file per region should help. The merging of HFiles during a scan is not heavily optimized yet.

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread lars hofhansl
If you can, try 0.94.4+; it should significantly reduce the amount of bytes copied around in RAM during scanning, especially if you have wide rows and/or large key portions. That in turns makes scans scale better across cores, since RAM is shared resource between cores (much like disk). It's n

HBase cluster replication firewall rules

2013-04-30 Thread Levy Meny
Hi, We are using HBase replication (over Apache 0.94.2) between two sites and we need to define firewall rules between the two sites. Can anyone provide some information regarding the ports that are used between the sites? Our understand is: Replication is from site1 to site2. * Region

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
The table has hashed keys so rows are evenly distributed amongst the regionservers, and load on each regionserver is pretty much the same. I also have per-table balancing turned on. I get mostly data local mappers with only a few rack local (maybe 10 of the 250 mappers). Currently the table is

Re: Schema Design Question

2013-04-30 Thread lars hofhansl
Same here. HBase is generally good at honing in to a small (maybe 10-100m rows) continuous subset of an essentially unlimited dataset. If all you ever do is scanning _everything_ and then throwing it away, a straight scan (using Impala for example) or direct M/R on file(s) in HDFS is far better.

Re: Read access pattern

2013-04-30 Thread lars hofhansl
I do not want to be rude or anything... But how often we need to have this discussion? When you salt your rowkeys with say 10 salt values then for each read you need to fork of 10 read requests, and each of them touches only 1/10th of the tables (which nicely with HBase's prefix scans). Obviou

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread lars hofhansl
Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe. I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all regi

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
Yes, I have it enabled (forgot to mention that). On Apr 30, 2013, at 9:56 PM, Ted Yu wrote: > Have you tried enabling short circuit read ? > > Thanks > > On Apr 30, 2013, at 9:31 PM, Bryan Keller wrote: > >> Yes, I have tried various settings for setCaching() and I have >> setCacheBlocks(fa

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Ted Yu
Have you tried enabling short circuit read ? Thanks On Apr 30, 2013, at 9:31 PM, Bryan Keller wrote: > Yes, I have tried various settings for setCaching() and I have > setCacheBlocks(false) > > On Apr 30, 2013, at 9:17 PM, Ted Yu wrote: > >> From http://hbase.apache.org/book.html#mapreduce.

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false) On Apr 30, 2013, at 9:17 PM, Ted Yu wrote: > From http://hbase.apache.org/book.html#mapreduce.example : > > scan.setCaching(500);// 1 is the default in Scan, which will > be bad for MapReduce jobs > sc

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Ted Yu
>From http://hbase.apache.org/book.html#mapreduce.example : scan.setCaching(500);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs I guess you have used the above setting. 0.94.x releases are compatible. Have y

Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
I have been attempting to speed up my HBase map-reduce scans for a while now. I have tried just about everything without much luck. I'm running out of ideas and was hoping for some suggestions. This is HBase 0.94.2 and Hadoop 2.0.0 (CDH4.2.1). The table I'm scanning: 20 mil rows Hundreds of col

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread James Taylor
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It'll use all of the parts of your row key and depending on how much data you're returning back to the client, will query over 10 million row in seconds. James @JamesPlusPlus http://phoenix-hbase.blogspot.com On Apr 30, 20

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
That depends on how dynamic your data is. If it is pretty static, you can also consider using something like Create Table As Select (CTAS) to create a snapshot of your data to HDFS and then run queries on top of that data. So your query might become something like: create table my_table as select

RE: Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Swarnim, Thanks. So this means custom map reduce is the viable option when working with hbase tables having composite keys, since it allows to set the start and stop keys. Hive+Hbase combination is out. Regards Rupinder From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] Sent:

Re: HBase is not running.

2013-04-30 Thread Yves S. Garret
Hi Jean-Marc. Thanks for the tip. However, for the moment at least, I'm going to be abandoning my forays into HBase, I received direction to focus on Hive instead. Again, thank you. Should I need help in the near future, I'll be sure to send the mailing list an enquiry. On Tue, Apr 30, 2013 a

Re: checkAnd...

2013-04-30 Thread Lior Schachter
Hi, We have a simple HBase schema: row key = subscriber id. Column family A = counters - all kinds of aggregations. Events records have a UUID, in some scenarios we might get duplicate events. We should not count the duplicates. A possible solution was to keep event ids as qualifiers in another C

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
Rupinder, Hive supports a filter pushdown[1] which means that the predicates in the where clause are pushed down to the storage handler level where either they get handled by the storage handler or delegated to hive if they cannot handle them. As of now, the HBaseStorageHandler only supports primi

Re: HBase and Datawarehouse

2013-04-30 Thread Michael Segel
Hmmm I don't recommend HBase in situations where you are not running a M/R Framework. Sorry, as much as I love HBase, IMHO there are probably better solutions for a standalone NoSQL Databases. (YMMV depending on your use case.) The strength of HBase is that its part of the Hadoop Ecosystem. I

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
Running more than one RS on a host is an option for soaking up "extra" RAM, since that is what we are discussing, but I can't recommend it because I have no experience with that approach. I think I do want to experiment with it, but not on a box with less than something like 16 or 24 cores. On Tu

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
You wouldn't do that if colocating MR. It is one way to soak up "extra" RAM on a large RAM box, although I'm not sure I would recommend it (I have no personal experience trying it, yet). For more on this where people are actively considering it, see https://issues.apache.org/jira/browse/BIGTOP-732

Re: HBase and Datawarehouse

2013-04-30 Thread Amandeep Khurana
Multiple RS' per host gets you around the WAL bottleneck as well. But it's operationally less than ideal. Do you usually recommend this approach, Andy? I've shied away from it mostly. On Apr 30, 2013, at 10:38 AM, Andrew Purtell wrote: > Rules of thumb for starting off safely and for easing supp

Re: HBase and Datawarehouse

2013-04-30 Thread Michael Segel
Multiple RS per host? Huh? That seems very counter intuitive and potentially problematic w M/R jobs. Could you expand on this? Thx -Mike On Apr 30, 2013, at 12:38 PM, Andrew Purtell wrote: > Rules of thumb for starting off safely and for easing support issues are > really good to have, bu

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
Can you show your query that is taking 700 seconds? On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh wrote: > Hi, > > ** ** > > I have an hbase cluster where I have a table with a composite key. I map > this table to a Hive external table using which I insert/select data > into/from this t

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread Sanjay Subramanian
My experience with hive + hbase has been about 8x slower on an average. So I went ahead with hive only option. Sent from my iPhone On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" mailto:rsi...@care.com>> wrote: Hi, I have an hbase cluster where I have a table with a composite key. I map this

RE: Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Here it is: select * from event where key.name='Signup' and key.dateCreated='2013-03-06 16:39:55.353' and key.uid='7af4c330-5988-4255-9250-924ce5864e3bf'; From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] Sent: Tuesday, April 30, 2013 11:25 PM To: u...@hive.apache.org Cc: use

Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Hi, I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table: CREATE EXTERNAL TABLE event(key struct, {more columns here}) ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '~' STORED BY

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
Rules of thumb for starting off safely and for easing support issues are really good to have, but there are no hard barriers or singular approaches: use Java 7 + G1GC, disable HBase blockcache in lieu of OS blockcache, run multiple regionservers per host. It is going to depend on how the cluster is

Re: Read access pattern

2013-04-30 Thread Michael Segel
Sure. By definition, the salt number is a random seed that is not associated with the underlying record. A simple example is a round robin counter (mod the counter by 10 yielding [0..9] ) So you get a record, prepend your salt and you write it out to HBase. The salt will push the data out to

Re: Read access pattern

2013-04-30 Thread James Taylor
bq. The downside that I see, is the bucket_number that we have to maintain both at time or reading/writing and update it in case of cluster restructuring. I agree that this maintenance can be painful. However, Phoenix (https://github.com/forcedotcom/phoenix) now supports salting, automating t

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread Jean-Marc Spaggiari
Hi John, Thanks for sharing that. Might help other people who are facing the same issues. JM 2013/4/30 John Foxinhead > Now I post my configurations: > I use a 3 nodes cluster with all the nodes runnind hadoop, zookeeper and > hbase. Hbase master, a zookeeper daemon and Hadoop namenode run on

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
I solved my problem with zookeeper. I don't know how, maybe it was a spell xD I made this way: on a slave i removed the directory of hbase, and i copied the diectory of hbase-pseudo-distribuited (which works). Then i copied all the configurations from the virtual machines which runned as master in

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
I solved the last problem: I modified the file /etc/hostname and i replaced the default hostname, "debian01" with "namenode", "jobtracker", or " datanode", the hostnames i used in hbase conf files. Now i start hbase fro master with "bin/start-hbase.sh" and regionservers, instead of trying to connec

RE: Read access pattern

2013-04-30 Thread ricla
Yes, I see, but this is quite expensive as the table is huge -Message d'origine- De : Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Envoyé : lundi 29 avril 2013 20:04 À : user@hbase.apache.org; ri...@laposte.net Objet : Re: Read access pattern HBASE-4811 is what you should be look

RE: Read access pattern

2013-04-30 Thread ricla
1. Change the schema If I understand correctly, in this scenario, I loose the ordering (changeDate desc). Moreover in my case, I could have 100k rows per objectId, meaning I would have to iterate a long list, but I understand the logic. If I only look for 24 hours before the original column hour

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
Now I post my configurations: I use a 3 nodes cluster with all the nodes runnind hadoop, zookeeper and hbase. Hbase master, a zookeeper daemon and Hadoop namenode run on the same host. Hbase regionserver, a zookeeper daemon and hadoop datanode run on the other 2 nodes. I called one of the datanodes

Re: Read access pattern

2013-04-30 Thread Shahab Yunus
Well those are *some* words :) Anyway, can you explain a bit in detail that why you feel so strongly about this design/approach? The salting here is not the only option mentioned and static hashing can be used as well. Plus even in case of salting, wouldn't the distributed scan take care of it? The

Re: discp versus export

2013-04-30 Thread Suraj Varma
Read this: http://blog.sematext.com/2011/03/11/hbase-backup-options/ for the high level difference between export and distcp. The key factor here is the data in memstore that has not been flushed out to disk yet ... and the resultant inconsistency if you just do distcp. --Suraj On Tue, Apr 30, 20

Re: discp versus export

2013-04-30 Thread Asaf Mesika
The replication.html reference appears to contain a reference to a bug (2611) which was solved two years ago :) On Wed, Mar 6, 2013 at 12:15 AM, Damien Hardy wrote: > IMO the easier would be hbase export. For long term offline backup (for > disaster recovery). It can even be stored on a differe

Re: Read access pattern

2013-04-30 Thread Michael Segel
Geez that's a bad article. Never salt. And yes there's a difference between using a salt and using the first 2-4 bytes from your MD5 hash. (Hint: Salts are random. Your hash isn't. ) Sorry to be-itch but its a bad idea and it shouldn't be propagated. On Apr 29, 2013, at 10:17 AM, Shahab Yu

Re: HBase and Datawarehouse

2013-04-30 Thread Michael Segel
Tell me why your RS needs to be that large? (> 8 GB. ) I think the answer is that it depends. Especially when you start to add in coprocessors. I'm not saying that there are not legitimate reasons, but that a lot of time, people just up the heap size without thinking about the problem. To Kevi

Re: Corrupt files

2013-04-30 Thread Jean-Marc Spaggiari
Bonjour Loïc, I don't thnk you can restore those blocks. If you have only one datanode and it doesn't have the missing blocks, there is no-where for hadoop to get those blocks back. So unfortunatly I don't think you can restore them. Also, this is more hadoop than hbase related. You might want to

Re: Corrupt files

2013-04-30 Thread Loic Talon
Hi Jean-Marc, Thanks. I have one datanode in my cluster. The node isn't down. How can I restore those blocks ? Loïc TALON mail.lta...@teads.tv Video Ads Solutions 2013/4/30 Jean-Marc Spaggiari > Hi Loïc, > > How many datanodes do you have on your cluster? Your replica

Re: HBase and Datawarehouse

2013-04-30 Thread Kevin O'dell
Asaf, The heap barrier is something of a legend :) You can ask 10 different HBase committers what they think the max heap is and get 10 different answers. This is my take on heap sizes from the many clusters I have dealt with: 8GB -> Standard heap size, and tends to run fine without any tunin

Re: HBase is not running.

2013-04-30 Thread Jean-Marc Spaggiari
Hi Yves, Your host file looks good. Don't even try the shell until you get the UI displayed correctly and the server logs saying that initialization is done. So what do you have on the logs when you are trying with this new host file? JM 2013/4/28 Asaf Mesika > Http://Devving.com has a good

Re: Corrupt files

2013-04-30 Thread Jean-Marc Spaggiari
Hi Loïc, How many datanodes do you have on your cluster? Your replication factor is set to 3 so I think you should have at least 3 datanodes? Is one of those nodes down? There is some blocks missing, they are maybe on a system which is down now? Bringing it back on might restore those blocks. JM

Re: Scala and Hbase, hbase-default.xml file seems to be for and old version of HBase (null)

2013-04-30 Thread Michel Segel
Isn't the defaults now embedded in the base jars? Sent from a remote device. Please excuse any typos... Mike Segel On Apr 29, 2013, at 11:55 PM, Håvard Wahl Kongsgård wrote: > Nope.. the system is clean only CDH4 on it. And I can't find > hbase-default.xml on the system. > > However, I solve

Re: max regionserver handler count

2013-04-30 Thread Viral Bajaria
Looked closely into the async API and there is no way to batch GETs to reduce the # of RPC calls and thus handlers. Will play around tomorrow with the handlers again and see if I can find anything interesting. On Tue, Apr 30, 2013 at 12:03 AM, Anoop John wrote: > If you can make use of the batch

Re: HBase Export MR - Some mappers getting Stuck

2013-04-30 Thread Ted Yu
Can you take a look at region server log when this happens and see if there is some clue ? jstack on region server side would help. Cheers On Mon, Apr 29, 2013 at 10:42 PM, Ashwanth Kumar < ashwanthku...@googlemail.com> wrote: > Hey, > > I have this issue where in some mappers get stuck mid-way

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
I don't wish to be rude, but you are making odd claims as fact as "mentioned in a couple of posts". It will be difficult to have a serious conversation. I encourage you to test your hypotheses and let us know if in fact there is a JVM "heap barrier" (and where it may be). On Monday, April 29, 2013

Re: max regionserver handler count

2013-04-30 Thread Anoop John
If you can make use of the batch API ie. get(List) you can reduce the handlers (and no# of RPC calls also).. One batch will use one handler. >I am using asynchbase which does not have the notion of batch gets I have not checked with asynchbase. Just telling as a pointer.. -Anoop- On Tue, Apr 3

Re: Scala and Hbase, hbase-default.xml file seems to be for and old version of HBase (null)

2013-04-30 Thread Håvard Wahl Kongsgård
Nope.. the system is clean only CDH4 on it. And I can't find hbase-default.xml on the system. However, I solved this issue my downloading http://hbase_master:60010/conf, renaming it to hbase-default.xml and adding that to the classpath So maybe a bug in CDH4. On Mon, Apr 29, 2013 at 11:36 PM, S