Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

2011-12-14 Thread Sam Seigal
That is an interesting comment. How would you enforce this in practice ? Can you give more details. On Wed, Dec 14, 2011 at 10:29 AM, Carson Hoffacker wrote: > The timerange scan is able to leverage metadata in each of the HFiles. Each > HFile should store information about the timerange associat

Re: regions and tables

2011-12-01 Thread Sam Seigal
one or more than one table is the same work: >>create the tables (one by one) with the list of split points. >> >>Lars >> >>On Dec 1, 2011, at 7:50 AM, Sam Seigal wrote: >> >>> HI, >>> >>> I had a question about the relationship  bet

regions and tables

2011-11-30 Thread Sam Seigal
HI, I had a question about the relationship between regions and tables. Is there a way to pre-create regions for multiple tables ? or each table has its own set of regions managed independently ? I read on one of the threads that there is really no limit on the number of tables, but that we nee

Re: Strategies for aggregating data in a HBase table

2011-11-30 Thread Sam Seigal
What about "partitioning" at a table level. For example, create 12 tables for the given year. Design the row keys however you like, let's say using SHA/MD hashes. Place transactions in the appropriate table and then do aggregations based on that table alone (this is assuming you won't get transacti

Re: snappy compression

2011-11-25 Thread Sam Seigal
Is there any concerns in applying the SNAPPY patch @ https://issues.apache.org/jira/browse/HBASE-3691 to 0.90.3 ? 2011/11/25 Gaojinchao : > You can search maillist about topic "Snappy for 0.90.4". > > > -邮件原件- > 发件人: saurabh@gmail.com [mailto:saurabh@gmail

snappy compression

2011-11-25 Thread Sam Seigal
Hi, The Compression.Algorithm enum does not have "SNAPPY" as an option in Hbase 0.90.3 (the version I am on). How can I create a table with SNAPPY compression via code ? Is this possible ? HColumnDescriptor.setCompressionType() takes Algorithm enumeration as a parameter. Thanks, Sam

Re: Region Splits

2011-11-22 Thread Sam Seigal
If you are prefixing your keys with predictable hashes, you can do range scans - i.e. create a scanner for each prefix and then merge results at the client. With unpredictable hashes and key reversals , this might not be entirely possible. I remember someone on the mailing list mentioning that Moz

Re: Hotspotting questions

2011-11-19 Thread Sam Seigal
The question really is if your region server hosting the hot tail end of the region during sequential *writes* can take the load or not. If you find in the future that it cannot, manually splitting the regions is not going to fix the problem IMHO, since the tail end is always the one that is going

Re: Schema design question - Hot Key concerns

2011-11-18 Thread Sam Seigal
One of the concerns I see with this schema is if one of the shows becomes hot. Since you are maintaining your bookings at the column level, a hot "row" cannot be partitioned across regions. Hbase is atomic at the row level. Therefore, different clients updating to the same SHOW_ID will compete with

block caching

2011-11-17 Thread Sam Seigal
I have a table that I only use for generating indexes. It rarely will have random reads, but will have M/R jobs running against it constantly for generating indexes. Even the index table, random reads will be rare. It will mostly be used for scanning blocks of data. According to HBase The Definit

Re: Metrics

2011-11-16 Thread Sam Seigal
I think that is expected: http://hbase.apache.org/metrics.html On Wed, Nov 16, 2011 at 1:10 PM, Mark wrote: > The only way I can get any metrics to work is if I append them to > HADOOP_HOME/conf/hadoop-metrics.properties. Is this expected? > > On 11/16/11 11:37 AM, Mark wrote: >> >> I've enabled

Re: Row get very slow

2011-11-14 Thread Sam Seigal
If you are not too concerned with random access time, but want more efficient scans, is increasing the block size then a good idea ? On Mon, Nov 14, 2011 at 11:24 AM, lars hofhansl wrote: > Did it speed up your queries? As you can see from the followup discussions > here, there is some general c

Re: querying questions

2011-10-28 Thread Sam Seigal
Open TDSB does it I believe : http://opentsdb.net/schema.html I am curious to know although the difference between having string as a row key, converting them into bytes, and then storing the key , as opposed to having numerical values stored as native bytes as bit masks. On Fri, Oct 28, 2011

Re: pre splitting tables

2011-10-26 Thread Sam Seigal
On Tue, Oct 25, 2011 at 1:02 PM, Nicolas Spiegelberg wrote: >>According to my understanding, the way that HBase works is that on a >>brand new system, all keys will start going to a single region i.e. a >>single region server. Once that region >>reaches a max region size, it will split and then mo

Re: pre splitting tables

2011-10-25 Thread Sam Seigal
p-front. If you are building indexes using MR, then you > probably don¹t need range scan ability on your keys. > > Thanks > Karthik > > > > On 10/24/11 4:48 PM, "Sam Seigal" wrote: > >>According to my understanding, the way that HBase works is that on

Re: pre splitting tables

2011-10-24 Thread Sam Seigal
ions. > > > On 10/24/11 9:07 AM, "Stack" wrote: > >>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal wrote: >>> According to the HBase book , pre splitting tables and doing manual >>> splits is a better long term strategy than letting HBase handle it. >&g

Re: pre splitting tables

2011-10-24 Thread Sam Seigal
Hi Stack, Inline. >> According to the HBase book , pre splitting tables and doing manual >> splits is a better long term strategy than letting HBase handle it. >> > > Its good for getting a table off the ground, yes. > > >> Since I do not know what the keys from the prod system are going to >> lo

pre splitting tables

2011-10-24 Thread Sam Seigal
According to the HBase book , pre splitting tables and doing manual splits is a better long term strategy than letting HBase handle it. I have done a lot of offline testing with HBase and I am at a stage now where I would like to hook my cluster into the production queue feeding data into our syst

Re: memory requirements for daemons

2011-10-14 Thread Sam Seigal
on RS. > > Arun > > On Oct 14, 2011, at 2:45 PM, Sam Seigal wrote: > >> Hi All, >> >> I have the Datanode, JobTracker and RegionServer daemons running on a >> fleet of machines. Each of these machines have 8 G of memory and are >> dedicated hardware for r

memory requirements for daemons

2011-10-14 Thread Sam Seigal
Hi All, I have the Datanode, JobTracker and RegionServer daemons running on a fleet of machines. Each of these machines have 8 G of memory and are dedicated hardware for running HBase. How do you guys decided which % of memory to assign to each ? What should this number be dependent on ? Thank yo

Re: Performance characteristics of scans using timestamp as the filter

2011-10-10 Thread Sam Seigal
Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself. On Mon, Oct 10, 2011 at 12:09 AM, Stei

Re: basic question for newbie

2011-10-09 Thread Sam Seigal
Start off with the HBase book, great resource for getting started: http://ofps.oreilly.com/titles/9781449396107/ On Sun, Oct 9, 2011 at 10:25 PM, Syg raf wrote: > Hello folks, > > I'm just starting with HBase and have a couple of rudimentary questions > about how to use it: > > I have a simple

Re: Using Scans in parallel

2011-10-05 Thread Sam Seigal
Scan object with start and row set to the region's > start and end key). > You probably want to group the regions by regionserver and have one thread > per region server, or something. > > > -- Lars > > From: Sam Seigal > To: hbase-u..

Using Scans in parallel

2011-10-05 Thread Sam Seigal
Hi , Is there a known way to be able to do Scan's in parallel (in different threads even) and then sort/combine the output ? For a row key like: prefix-event_type-event_id prefix-event_type-event_id I want to declare two scan objects (for say event_id_type foo) Scan 1 => 0-foo Scan 2 => 1-fo

Re: querying values by row

2011-09-30 Thread Sam Seigal
query you should consider populating a second table with > event_type-eventid as key, and timestamp as value. > Why is the timestamp part of the key? > > > -- Lars > > > - Original Message - > From: Sam Seigal > To: hbase-u...@hadoop.apache.org > Cc: >

Re: querying values by row

2011-09-29 Thread Sam Seigal
id is to first do a GET or a Scan to get the value, determine the exact timestamp for the record and then write the updated value. Is there a better way to do this in one server call ? Thanks ! Sam On Thu, Sep 29, 2011 at 6:27 PM, Sam Seigal wrote: > Hi, > > I am wondering what is t

querying values by row

2011-09-29 Thread Sam Seigal
Hi, I am wondering what is the best way to query a record when only the leading and trailing letters of a row are known. For example, if my row looks something like: event_type-timestamp-eventid If i know the event_type and eventid, but do not really care about the timestamp, what is the most e

warning exceptions in log

2011-09-22 Thread Sam Seigal
Hi, I am running some tests with Hbase on some sample data. I keep on seeing this exception warning in the logs: Fri Sep 23 00:35:31 2011 GMT regionserver 7193-0@star1:0 [WARN] (IPC Server handler 2 on 60020) org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020 caught: java.nio.channe

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

2011-09-16 Thread Sam Seigal
If an input split is too large and memory a concern, we can surely address this in TableInputFormat.getSplits() and limit the size ... On Fri, Sep 16, 2011 at 6:39 PM, Sam Seigal wrote: > Aren't there memory considerations with this approach ? I would assume > the HashMap can get pret

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

2011-09-16 Thread Sam Seigal
writes would only happen once per > map-task, and not do it on a per-row basis (which would be really > expensive). > > A single region on a single RS could handle that no problem. > > > > > On 9/16/11 9:00 PM, "Sam Seigal" wrote: > >>I see what you ar

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

2011-09-16 Thread Sam Seigal
ine that you would need to tune the >>temp-table for the job and pre-create regions. >> >>Doug >> >> >> >>On 9/16/11 8:16 PM, "Sam Seigal" wrote: >> >>>I am trying to do something similar with HBase Map/Reduce. >>> >>>

Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...

2011-09-16 Thread Sam Seigal
I am trying to do something similar with HBase Map/Reduce. I have event ids and amounts stored in hbase in the following format: prefix-event_id_type-timestamp-event_id as the row key and  amount as the value I want to be able to aggregate the amounts based on the event id type and for this I am u

optimizing for map reduce jobs

2011-09-06 Thread Sam Seigal
Hi All, I would like to get your opinion on how to best optimize an HBase cluster for map reduce jobs. The main purpose that we would like to experiment with HBase is to do near real time aggregations for the data we receive. There is a service that writes a constant stream of data to HBase. I wou

Re: HBase and Cassandra on StackOverflow

2011-08-30 Thread Sam Seigal
> Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > > > > > >From: Sam Seigal > >To: user@hbase.apache.org; Andrew Purtell > >Cc: "hbase-u...@hadoop.apache.org" > >Sen

Re: HBase and Cassandra on StackOverflow

2011-08-30 Thread Sam Seigal
A question inline: On Tue, Aug 30, 2011 at 2:47 AM, Andrew Purtell wrote: > Hi Chris, > > Appreciate your answer on the post. > > Personally speaking however the endless Cassandra vs. HBase discussion is > tiresome and rarely do blog posts or emails in this regard shed any light. > Often, Cassan

Re: question about regionserver going down and then coming back up

2011-08-23 Thread Sam Seigal
Ah .. thanks ! I am using 0.90.1 right now, so that explains it. On Tue, Aug 23, 2011 at 3:31 PM, Jean-Daniel Cryans wrote: > It was fixed in either 0.90.3 or 0.90.4 > > J-D > > On Mon, Aug 22, 2011 at 7:47 PM, Sam Seigal wrote: > > Hi All, > > > > I had a r

question about regionserver going down and then coming back up

2011-08-22 Thread Sam Seigal
Hi All, I had a regionserver go down in my cluster. When I ran the "status" on the hbase shell I got, 4 live servers and 1 dead (which is correct). However, when the machine came back up and I started the regionserver on it, and did ran "status" on the hbase shell , the output shows 5 live serve

operational overhead for HBase

2011-08-16 Thread Sam Seigal
Hi All, I had a question about the operational overhead of maintaining HBase in production. Would someone care to share their experiences ? We have a team of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know if we would need the same for HBase. I am talking of a small clust

Re: quick query about importtsv

2011-08-02 Thread Sam Seigal
reason to use the same version timestamp for all lines passed into the mapper ? On Tue, Aug 2, 2011 at 5:13 PM, Sam Seigal wrote: > Hi All, > > I am using the importtsv tool to load some data into an hbase cluster. Some > of the row keys + cf:qualifier might occur more than once with

quick query about importtsv

2011-08-02 Thread Sam Seigal
Hi All, I am using the importtsv tool to load some data into an hbase cluster. Some of the row keys + cf:qualifier might occur more than once with a different value in the files I have generated. I would expect this to just create two versions of the record with the different values. However, I am

compressions and security

2011-07-16 Thread Sam Seigal
Hi All, A quick question on compression. I saw that HBase can use LZO compression for storing data into the HFile. Has anyone done experiments with using compressions at the application level instead instead of letting HBase handle it ? Are there advantages/disadvantages of this approach ? Is it

Re: HBase region size

2011-07-01 Thread Sam Seigal
On Thu, Jun 30, 2011 at 11:33 PM, Stack wrote: > On Mon, Jun 27, 2011 at 11:37 PM, Aditya Karanth A > wrote: > >> I have heard that bigger the size of the regionserver, more time it > takes > >> for region splitting and slower the reads are. Is this true? > > (I have not been able to experiment

descaling hbase

2011-06-28 Thread Sam Seigal
Hi All, I have a 14 node cluster setup for HBase. Someone else in my office needs to use some of these machines and I would like to descale my cluster from 14 to 6 machines. Is there an efficient way to do this ? Since there is data residing on the machines I want to get rid of, are there utilitie

checkAndPut() failing with NotServingRegionException

2011-06-22 Thread Sam Seigal
Hi, I am loading data into my HBase cluster and running into two issues - During my import, I received the following exception -> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 53484 actions: servers with issues: spock7001:60020, at org.apache.hadoop.hbase.cl

check existence of row through checkAndPut()

2011-06-20 Thread Sam Seigal
Hi, I had a question about how to check for existence of a record in HBase. I went through some threads discussing the various techniques , mainly - row locks and checkAndPut(). My schema looks like the following -> --- The reason I am adding the prefix is to avoid hot spotting due to increasin

Re: Insert a lot of data in HBase

2011-06-20 Thread Sam Seigal
When using the write cache and setting setAutoFlush() to false, is there a risk of data loss, even if WAL is enabled ? On Mon, Jun 20, 2011 at 12:27 PM, Jeff Whiting wrote: > There is the possibility that your keys have the same timestamp -- > especially if you are running multi-threaded. If th

checkAndPut() and idempotence handling in Hbase

2011-06-16 Thread Sam Seigal
Hi All, I am trying to load data from my OLTP system into HBase. I am using checkAndPut() to do this. The reason I am using checkAndPut() and not put() is because the system I am writing has idempotence requirements i.e. a value will be initially written with a start state, and then with an end s

hbase architecture question

2011-06-14 Thread Sam Seigal
Hi All, I had some questions about the hbase architecture that I am a little confused about. After doing reading over the internet / HBase book etc, my understanding of regions is the following -> When the cluster initially starts up (with no data), the regionservers come online. When the data

question about scanning data

2011-06-10 Thread Sam Seigal
Hi All, I had a question about a certain kind of query I would like to do in hbase. I am storing records in HBase that transition from an initial state "A" to an end state "B" . Initially, the record I will store will look like the following -> t1 rowid:columnFamily:A when I get a notificatio

Re: hbase hashing algorithm and schema design

2011-06-09 Thread Sam Seigal
sage > From: Sam Seigal > To: user@hbase.apache.org > Cc: j...@cloudera.com; tsuna...@gmail.com > Sent: Wed, June 8, 2011 4:54:24 PM > Subject: Re: hbase hashing algorithm and schema design > > On Wed, Jun 8, 2011 at 12:40 AM, tsuna wrote: > > > On Tue, Jun 7,

Re: hbase hashing algorithm and schema design

2011-06-09 Thread Sam Seigal
multiple scanners ? Thanks a lot for your help. -- *From:* Joey Echeverria *To:* Sam Seigal *Sent:* Wed, June 8, 2011 5:08:32 PM *Subject:* Re: hbase hashing algorithm and schema design A better option than a uuid would be to take a hash of the eventid-timestamp

Re: hbase hashing algorithm and schema design

2011-06-08 Thread Sam Seigal
On Wed, Jun 8, 2011 at 12:40 AM, tsuna wrote: > On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned wrote: > > I was studying the OpenTSDB example, where they also prefix the row keys > with > > event id. > > > > I further modified my row keys to have this -> > > > > > > > > The uuid is fairly unique

Re: hbase hashing algorithm and schema design

2011-06-03 Thread Sam Seigal
gt; prefix each key with a hash of the key. The downside is sequential scans now > have to be performed with multiple scanners and re-ordered client side. > > -Joey > > On Jun 3, 2011, at 3:35, Sam Seigal wrote: > > > Hi, > > > > I am not able to find information regard

hbase hashing algorithm and schema design

2011-06-03 Thread Sam Seigal
Hi, I am not able to find information regarding the algorithm that decides which region a particular row belongs to in an HBase cluster. Does the algorithm take into account the number of physical nodes ? Where can I find more details about it ? I went through the HBase book and the OpenTSDB sche

follow up question on row key schema design

2011-06-02 Thread Sam Seigal
Hi, I am not able to find information regarding the algorithm that decides which region a particular row belongs to in an HBase cluster. Does the algorithm take into account the number of physical nodes ? Where can I find more details about it ? I went through the HBase book and the OpenTSDB sche