Hbase schema design help

2012-02-14 Thread Raj N
Hi All, I am new to NoSQL world, I need help/suggestion to design Hbase schema for the below requirement, It is a report generation application using hadoop. Now I want to store a particular user's report history in Hbase. The user's email id will be used to track all his previous ran report

Re: ERROR zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries

2012-02-14 Thread Bing Li
Dear Jean-Daniel, The issue is solved. I think the book in the HBase the Definitive Guide does not give sufficient descriptions about the pseudo-distributed mode. Thanks so much! Bing On Tue, Feb 14, 2012 at 7:27 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Is zookeeper running properly?

Re: LeaseException while extracting data via pig/hbase integration

2012-02-14 Thread Mikael Sitruk
hi, Well no, i can't figure out what is the problem, but i saw that someone else had the same problem (see email: LeaseException despite high hbase.regionserver.lease.period) What can i tell is the following: Last week the problem was consistent 1. I updated hbase.regionserver.lease.period=30

Re: Hbase schema design help

2012-02-14 Thread Monish r
Hi, U can set the max versions for that table as INTEGER.MAX , so that the records are identified uniquely by means of timestamp ( milliseconds ) in which they are inserted . In hbase each and every cell in the table is indexed so if u have more number of columns , u can store them as a

investigating replacing RDBMS with HBase based solution - spliting daily data inflow?

2012-02-14 Thread Igor Lautar
Hi All, I'm doing an investigation in performance and scalability improvements for one of solutions. I'm currently in a phase where I try to understand if HBase (+MapReduce) could provide the scalability needed. This is the current situation: - assume daily inflow of 10 GB of data (20+ milion

HBase and Data Locality

2012-02-14 Thread Praveen Sripati
Hi, Lars blog (1) mentions that data locality for the region servers is lost when HBase cluster is restarted. It's also mentioned at the end that work is going in HBase to assign regions to RS taking data locality into consideration. The blog entry is 18 months old and so I would like to know if

Re: HBase and Data Locality

2012-02-14 Thread Brock Noland
Hi, On Tue, Feb 14, 2012 at 7:13 AM, Praveen Sripati praveensrip...@gmail.com wrote: Lars blog (1) mentions that data locality for the region servers is lost when HBase cluster is restarted. It's also mentioned at the end that work is going in HBase to assign regions to RS taking data locality

Re: Can coprocessor operate HDFS directly?

2012-02-14 Thread Sanel Zukan
AFAIK it is possible, just make sure regionservers can see hadoop jar (which is true by default). Actually, you can call anything from these methods ;) On Tue, Feb 14, 2012 at 9:15 AM, NNever nnever...@gmail.com wrote: As we know in HBase coprocessor methods such as prePut, we can operate

Re: HBase and Data Locality

2012-02-14 Thread Mikael Sitruk
Region allocation is kept in the next restart ( https://issues.apache.org/jira/browse/HBASE-2896 ). This is also present in the CDH3 code. Nevertheless if you have a server that did not start correctly you will have region that will move from it and locality will not remain (even after you start

Re: how get() works

2012-02-14 Thread Vamshi Krishna
Thank you Doug.. Onemore question is, If a particular region is found by looking at the range handeled by it, How is search performed within that region to find requested rowKey? Is it by linear search or binary search or any other algorithm? Or for every row in that region, is there any hash

Re: Hbase schema design help

2012-02-14 Thread Mikael Sitruk
Why don't you prefix the columns with an execution date (reverse order so the last execution is the first one?) that is: email id (row key) - (columns) appName:reportName, appName:executionDate_startDate, appName:executionDate_endDate, appName: executionDate_status So all execution for a specific

Re: Can coprocessor operate HDFS directly?

2012-02-14 Thread NNever
Thanks Sanel. I try to use *FileSystem fs = FileSystem.get(HBaseConfiguration.create());* *fs.delete(new Path(...))* in corpocessor's preDelete method. There is no exception, but the target-path file has not deleted after those code also. I don't know why... It's late night here now. I'll try

Re: information, whether a GET Request inside Map-Task is data local or not

2012-02-14 Thread Christopher Dorner
Hi, sorry for a very late reply on this topic, but i was busy and now i promised to report back. I implemented your suggested hack :) It is actually only few lines of code. One for getting the machines hostname and one for retrieving the destination of the get request. Then i set up two counters,

strange PerformanceEvaluation behaviour

2012-02-14 Thread Oliver Meyn (GBIF)
Hi all, I've been trying to run a battery of tests to really understand our cluster's performance, and I'm employing PerformanceEvaluation to do that (picking up where Tim Robertson left off, elsewhere on the list). I'm seeing two strange things that I hope someone can help with: 1) With a

Re: how get() works

2012-02-14 Thread Doug Meil
Keys are stored in sorted order, it's basically a binary search. On 2/14/12 9:31 AM, Vamshi Krishna vamshi2...@gmail.com wrote: Thank you Doug.. Onemore question is, If a particular region is found by looking at the range handeled by it, How is search performed within that region to find

Re: Can coprocessor operate HDFS directly?

2012-02-14 Thread Stack
On Tue, Feb 14, 2012 at 6:35 AM, NNever nnever...@gmail.com wrote: Thanks Sanel. I try to use *FileSystem fs = FileSystem.get(HBaseConfiguration.create());* *fs.delete(new Path(...))* in corpocessor's preDelete method. There is no exception, but the target-path file has not deleted after

Re: how get() works

2012-02-14 Thread Doug Meil
I say basically because inside a Region there are Stores, and for each Store there are StoreFiles. For more info see: http://hbase.apache.org/book.html#regions.arch On 2/14/12 11:06 AM, Doug Meil doug.m...@explorysmedical.com wrote: Keys are stored in sorted order, it's basically a

Re: strange PerformanceEvaluation behaviour

2012-02-14 Thread Stack
On Tue, Feb 14, 2012 at 7:56 AM, Oliver Meyn (GBIF) om...@gbif.org wrote: 1) With a command line like 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' I see 100 mappers spawned, rather than the expected 10.  I expect 10 because that's what the usage text implies, and

length and size of a column family name or qualifier vs. amount of disk storage

2012-02-14 Thread Neil Yalowitz
Hi all, here's a (not-so) hypothetical question... How does a given column family name or a qualifier impact storage? Would a long family or qualifer like this: my-descriptive-but-long-column-family-name:my-descriptive-but-long-qualifier --vs. a short column family and qualifier:--

Re: length and size of a column family name or qualifier vs. amount of disk storage

2012-02-14 Thread Jean-Daniel Cryans
We are assuming the longer cf/qual would be written to HDFS billions of times and would be wasteful.  Is that a correct assumption? Yes, also that's covered a bit in: http://hbase.apache.org/book.html#keysize Does the answer change if you use Snappy compression? Any compression will make it

multiple partial scans in the row

2012-02-14 Thread James Young
Hi there, I am pretty new to HBase and i am trying to understand the best practice to do the scan based on two/multiple partial scans for the row key. For example, I have a row key like: orderId-timeStamp-item. The orderId has nothing to with the timeStamp and i have a requirement to scan rows

Re: multiple partial scans in the row

2012-02-14 Thread Ian Varley
James, Are your orderIds ordered? You say a range of orderIds, which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix

Re: information, whether a GET Request inside Map-Task is data local or not

2012-02-14 Thread Jean-Daniel Cryans
Hey Christopher, Thanks for reporting back. One thing about this is unless you have contention at your top of the rack switches, issuing a get on the local node or a remote one shouldn't be very different. What is going to make a big difference is if you have to hit disk or not. J-D On Tue, Feb

Re: length and size of a column family name or qualifier vs. amount of disk storage

2012-02-14 Thread Doug Meil
Also see here... http://hbase.apache.org/book.html#keyvalue Compression will make it better on disk, but it will inflate over the wire. On 2/14/12 12:40 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: We are assuming the longer cf/qual would be written to HDFS billions of times and

Re: LeaseException while extracting data via pig/hbase integration

2012-02-14 Thread Jean-Daniel Cryans
On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk mikael.sit...@gmail.com wrote: hi, Well no, i can't figure out what is the problem, but i saw that someone else had the same problem (see email: LeaseException despite high hbase.regionserver.lease.period) What can i tell is the following: Last

Re: ERROR zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries

2012-02-14 Thread Jean-Daniel Cryans
And what would be missing? It's all open source so this is the moment where you can forever leave a trace in HBase :) J-D On Tue, Feb 14, 2012 at 12:35 AM, Bing Li lbl...@gmail.com wrote: Dear Jean-Daniel, The issue is solved. I think the book in the HBase the Definitive Guide does not give

Re: ERROR zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries

2012-02-14 Thread Doug Meil
To stress what JD just said, the HBase book/Ref Guide (i.e., the online book that is a part of HBase) is open source and the best source of the material (especially the Troubleshooting chapter) is user experience. Minor clarification: HBase the Definitive Guide is a great book by O'Reilly, but

Re: LeaseException while extracting data via pig/hbase integration

2012-02-14 Thread Mikael Sitruk
Please see answer inline Thanks Mikael.S On Tue, Feb 14, 2012 at 8:30 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk mikael.sit...@gmail.com wrote: hi, Well no, i can't figure out what is the problem, but i saw that someone else had the

Re: Improving HBase read performance (based on YCSB)

2012-02-14 Thread Bharath Ravi
Thanks Todd! I check disk bandwidth by first running hparm on it, (this shows me a read b/w of around 56Mbps) and then running iftop while the benchmarks run (This shows me that reads are only around 10-15Mbps: but this could definitely be because random seeks are a bottleneck) The iostat output

Re: multiple partial scans in the row

2012-02-14 Thread James Young
Thank you Ian! Yes, the orderIds are ordered. I might try timeStamp filter. But it still doesn't provide the early out feature. not sure how the performance it could be. Do you think it might be worth having a custom filter to do two partial scans? Thanks again. James On Wed, Feb 15, 2012 at

Re: Can coprocessor operate HDFS directly?

2012-02-14 Thread NNever
It works. Thanks Stack and Sanel~ 2012/2/15 Stack st...@duboce.net On Tue, Feb 14, 2012 at 6:35 AM, NNever nnever...@gmail.com wrote: Thanks Sanel. I try to use *FileSystem fs = FileSystem.get(HBaseConfiguration.create());* *fs.delete(new Path(...))* in corpocessor's preDelete

Re: Improving HBase read performance (based on YCSB)

2012-02-14 Thread Todd Lipcon
Yep, definitely bound on seeks - see the 100% util, and the r/s 100. The bandwidth provided by random IO from a disk is going to be much smaller than the sequential IO you see from hdparm -Todd On Tue, Feb 14, 2012 at 3:06 PM, Bharath Ravi bharathra...@gmail.com wrote: Thanks Todd! I check

Re: strange PerformanceEvaluation behaviour

2012-02-14 Thread Stack
On Tue, Feb 14, 2012 at 8:14 AM, Stack st...@duboce.net wrote: 2) With that same randomWrite command line above, I would expect a resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M rows).   Instead what I'm seeing is that the randomWrite job reports writing that many

Re: 0.92 in mvn repository somewhere?

2012-02-14 Thread Ulrich Staudinger
Hi St.Ack, i don't wanna be a pain in the back, but any progress on this? Cheers, Ulrich On Tue, Feb 7, 2012 at 8:40 PM, Stack st...@duboce.net wrote: This is my fault. I'm working on it. Will update list when done. Sorry its taking me so long. St.Ack On Tue, Feb 7, 2012 at 9:11 AM,