de (Category + Timestamp).
>
> -- Lars
>
>
>
>
> From: Kamal Bahadur
> To: user ; Dhaval Shah >
> Sent: Monday, December 23, 2013 3:47 PM
> Subject: Re: Schema Design Newbie Question
>
>
> Hi Dhaval,
>
> Thanks for
perform).
Option #1 should be better. HBase will be smart just scanning the HFile
necessary for the key range you provide (Category + Timestamp).
-- Lars
From: Kamal Bahadur
To: user ; Dhaval Shah
Sent: Monday, December 23, 2013 3:47 PM
Subject: Re: Schema
Hi Dhaval,
Thanks for the quick response!
Why do you think having more files is not a good idea? Is it because of OS
restrictions?
I get around 50 million records a day and each record contains ~25
columns. Values for each column are ~30 characters.
Kamal
On Mon, Dec 23, 2013 at 3:35 PM, Dha
A 1000 CFs with HBase does not sound like a good idea.
category + timestamp sounds like the better of the 2 options you have thought
of.
Can you tell us a little more about your data?
Regards,
Dhaval
From: Kamal Bahadur
To: user@hbase.apache.org
Sent:
In terms of scalability, yes, but we use HBase for other stuff aswell,
timeseries, counters and few future ideas around analytics. So its nice if
we can put everything in same deployment.
We dont want users to care about the physical storage (keep them productive
in Java land). The point here of b
This doesn't make sense in that the OP wants schema less structure, yet wants
filtering on columns. The issue is that you do have a limited Schema, so Schema
less is a misnomer.
In order to do filtering, you need to enforce object type within a column which
requires a Schema to be enforced.
A
Yep. Other DBs like
Mongo may have the stuff you need out of the box.
Another option is to encode the whole class using Avro, and writing a
filter on top of that.
You basically use one column and store it there.
Yes, you pay the penalty of loading your entire class and extract the
fields you need t
Hi,
I see. Btw. isn't HBase for < 1M rows an overkill?
Note that Lucene is schemaless and both Solr and Elasticsearch can
detect field types, so in a way they are schemaless, too.
Otis
--
Performance Monitoring -- http://sematext.com/spm
On Fri, Jun 28, 2013 at 2:53 PM, Kristoffer Sjögren wr
@Otis
HBase is a natural fit for my usecase because its schemaless. Im building a
configuration management system and there is no need for advanced
filtering/querying capabilities, just basic predicate logic and pagination
that scales to < 1 million rows with reasonable performance.
Thanks for th
Kristoffer,
You could also consider using something other than HBase, something
that supports "secondary indices", like anything that is Lucene based
- Solr and ElasticSearch for example. We recently compared how we
aggregate data in HBase (see my signature) and how we would do it if
we were to u
Why is it that if all you have is a hammer, everything looks like a nail? ;-)
On Jun 27, 2013, at 8:55 PM, James Taylor wrote:
> Hi Kristoffer,
> Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? You
> could model your schema much like an O/R mapper and issue SQL querie
Interesting. Im actually building something similar.
A fullblown SQL implementation is bit overkill for my particular usecase
and the query API is the final piece to the puzzle. But ill definitely have
a look for some inspiration.
Thanks!
On Fri, Jun 28, 2013 at 3:55 AM, James Taylor wrote:
>
Hi Kristoffer,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? You
could model your schema much like an O/R mapper and issue SQL queries through
Phoenix for your filtering.
James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com
On Jun 27, 2013, at 4:39 PM, "Kristoffer S
Thanks for your help Mike. Much appreciated.
I dont store rows/columns in JSON format. The schema is exactly that of a
specific java class, where the rowkey is a unique object identifier with
the class type encoded into it. Columns are the field names of the class
and the values are that of the ob
Ok...
If you want to do type checking and schema enforcement...
You will need to do this as a coprocessor.
The quick and dirty way... (Not recommended) would be to hard code the schema
in to the co-processor code.)
A better way... at start up, load up ZK to manage the set of known table
s
I see your point. Everything is just bytes.
However, the schema is known and every row is formatted according to this
schema, although some columns may not exist, that is, no value exist for
this property on this row.
So if im able to apply these "typed comparators" to the right cell values
it ma
You have to remember that HBase doesn't enforce any sort of typing.
That's why this can be difficult.
You'd have to write a coprocessor to enforce a schema on a table.
Even then YMMV if you're writing JSON structures to a column because while the
contents of the structures could be the same, t
I realize standard comparators cannot solve this.
However I do know the type of each column so writing custom list
comparators for boolean, char, byte, short, int, long, float, double seems
quite straightforward.
Long arrays, for example, are stored as a byte array with 8 bytes per item
so a comp
Not an easy task.
You first need to determine how you want to store the data within a column
and/or apply a type constraint to a column.
Even if you use JSON records to store your data within a column, does an
equality comparator exist? If not, you would have to write one.
(I kinda think tha
HDFS
> is far better.
>
> -- Lars
>
>
>
>
> From: Michel Segel
> To: "user@hbase.apache.org"
> Cc: "user@hbase.apache.org"
> Sent: Monday, April 29, 2013 6:52 AM
> Subject: Re: Schema Design Question
>
>
> I would have to
.
-- Lars
From: Michel Segel
To: "user@hbase.apache.org"
Cc: "user@hbase.apache.org"
Sent: Monday, April 29, 2013 6:52 AM
Subject: Re: Schema Design Question
I would have to agree.
The use case doesn't make much sense for HB
I would have to agree.
The use case doesn't make much sense for HBase and sounds a bit more like a
problem for Hive.
The OP indicated that the data was disposable after a round of processing.
IMHO Hive is a better fit.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Apr
I actually don't see the benefit of saving the data into HBase if all you
do is read per job id and purges it. Why not accumulate into HDFS per job
id and then dump the file? The way I see it, HBase is good for querying
parts of your data, even if it is only 10 rows. In your case your average
is 1
gt; I think the main problem is that all CFs have to be flushed if one
> >> gets
> >>>>> large enough to require a flush.
> >>>>> (Does anyone remember why exactly that is? And do we still need that
> >> now
> >>>>> that
Hi,
Interesting use case. I think it depends on job many jobId's you expect to
have. If it is on the order of thousands, I would caution against going the
one table per jobid approach, since for every table, there is some master
overhead, as well as file structures in hdfs. If jobId's are managabl
My understanding of your use case is that data for different jobIds would
be continuously loaded into the underlying table(s).
Looks like you can have one table per job. This way you drop the table
after map reduce is complete. In the single table approach, you would
delete many rows in the table
>>>>
>>>>> I think the main problem is that all CFs have to be flushed if one
>> gets
>>>>> large enough to require a flush.
>>>>> (Does anyone remember why exactly that is? And do we still need that
>> now
>>>>>
t the memstoreTS is stored in the HFiles?)
> > > >
> > > >
> > > >So things are fine as long as all CFs have roughly the same size. But
> if
> > > >you have one that gets a lot of data and many others that are smaller,
> > > >we'd end up with a lot of unnecessary and sm
any others that are smaller,
> > >we'd end up with a lot of unnecessary and small store files from the
> > >smaller CFs.
> > >
> > >Anything else known that is bad about many column families?
> > >
> > >
> > >-- Lars
> > &
up with a lot of unnecessary and small store files from the
> >smaller CFs.
> >
> >Anything else known that is bad about many column families?
> >
> >
> >-- Lars
> >
> >
> >
> >
> > From: Andrew Purtell
> >To: "user@hbase.apac
own that is bad about many column families?
>
>
>-- Lars
>
>
>
>
> From: Andrew Purtell
>To: "user@hbase.apache.org"
>Sent: Sunday, April 7, 2013 3:52 PM
>Subject: Re: schema design: rows vs wide columns
>
>Is there a pointer to evidence/ex
StAck,
Just because FB does something doesn't mean its necessarily a good idea for
others to do the same. FB designs specifically for their needs and their use
cases may not match those of others.
To your point though, I agree that Ted's number of 3 is more of a rule of thumb
and not a hard
e files from the smaller CFs.
>
> Anything else known that is bad about many column families?
>
>
> -- Lars
>
>
>
>
> From: Andrew Purtell
> To: "user@hbase.apache.org"
> Sent: Sunday, April 7, 2013 3:52 PM
> Subje
"user@hbase.apache.org"
Sent: Sunday, April 7, 2013 3:52 PM
Subject: Re: schema design: rows vs wide columns
Is there a pointer to evidence/experiment backed analysis of this question?
I'm sure there is some basis for this text in the book but I recommend we
strike it. We could rep
I agree with Andrew here and also Stack's comment on FB usage with 15 CFs
is interesting.
Whenever people read that line from the doc, people used to ask why is it
so and also i was thinking that one restriction of having max 3 CFs was one
factor which sometimes made schema design a bit challengin
I think this whole idea of don't go over a certain number of column
families was a 2+ year old story. I remember hearing numbers like 5 or 6
(not 3) come up when talking at Hadoop conferences with engineers who were
at companies that were heavy HBase users. I agree with Andrew's suggestion
that we
Is there a pointer to evidence/experiment backed analysis of this question?
I'm sure there is some basis for this text in the book but I recommend we
strike it. We could replace it with YCSB or LoadTestTool driven latency
graphs for different workloads maybe. Although that would also be a big
simpl
On Sun, Apr 7, 2013 at 3:27 PM, Ted Yu wrote:
> From http://hbase.apache.org/book.html#number.of.cfs :
>
> HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
>
We should add more to that section. FB run w
>From http://hbase.apache.org/book.html#number.of.cfs :
HBase currently does not do well with anything above two or three column
families so keep the number of column families in your schema low.
Cheers
On Sun, Apr 7, 2013 at 3:04 PM, Stack wrote:
> On Sun, Apr 7, 2013 at 11:58 AM, Ted wrote:
On Sun, Apr 7, 2013 at 11:58 AM, Ted wrote:
> With regard to number of column families, 3 is the recommended maximum.
>
How did you come up w/ the number '3'? Is it a 'hard' 3? Or does it
depend? If the latter, on what does it depend?
Thanks,
St.Ack
If you store service Id by month, how do you deal with time range in query that
spans partial month(s) ?
With regard to number of column families, 3 is the recommended maximum.
Cheers
On Apr 7, 2013, at 1:03 AM, shawn du wrote:
> Hello,
>
> I am newer for hbase, but i have some experience o
stian
- Ursprüngliche Message -
Von: Christian Schäfer
An: "user@hbase.apache.org"
CC:
Gesendet: 22:54 Montag, 20.August 2012
Betreff: RE: Schema Design - Move second column family to new table
Thanks Pranav for the Schema Design resource...will check this soon.
&
Thanks Ia
An: "user@hbase.apache.org"
CC: Christian Schäfer
Gesendet: 16:37 Montag, 20.August 2012
Betreff: Re: Schema Design - Move second column family to new table
Christian,
Column families are really more "within" rows, not the other way around
(they're really just a way to physi
Christian,
Column families are really more "within" rows, not the other way around
(they're really just a way to physically partition sets of columns in a table).
In your example, then, it's more correct to say that table1 has millions /
billions of rows, but only hundreds of them have any colu
This might be useful -
http://java.dzone.com/videos/hbase-schema-design-things-you
On Mon, Aug 20, 2012 at 5:17 PM, Christian Schäfer wrote:
> Currently I'm about to design HBase tables.
>
> In my case there is table1 with CF1 holding millions/billions of rows and
> CF2 with hundreds of rows.
> R
Hi,
OK...
First a caveat... I haven't seen your initial normalized schema, so take what I
say with a grain of salt...
The problem you are trying to solve is one which can be solved better on an
RDBMS platform and does not fit well in a NoSQL space.
Your scalability issue would probably be bet
're probably going to want to split your data in to two different
> tables and then write some ACID compliance at your APP level.
>
> Just a quick thought before I pop out for lunch...
>
>
>> Date: Fri, 18 Nov 2011 10:02:54 -0800
>> Subject: Re: Schema design quest
write some ACID compliance at your APP level.
Just a quick thought before I pop out for lunch...
> Date: Fri, 18 Nov 2011 10:02:54 -0800
> Subject: Re: Schema design question - Hot Key concerns
> From: selek...@yahoo.com
> To: user@hbase.apache.org
>
> One of the concerns I
One of the concerns I see with this schema is if one of the shows
becomes hot. Since you are maintaining your bookings at the column
level,
a hot "row" cannot be partitioned across regions. Hbase is atomic at
the row level. Therefore, different clients updating to the same
SHOW_ID
will compete with
I think that your mileage will definitely vary on this point. Your
design may work very well. Or not. I would worry just a bit if your
data points are large enough to create a really massive row (greater
than about a megabyte).
On Sun, Apr 17, 2011 at 11:48 PM, Yves Langisch wrote:
> So I wond
Yes, you're right. They have a row for each 10 minute period. Inside a row they
work with offsets in seconds within this 10 minute period. This leads to a
maximum of 10*60 columns per row. Normally you have less columns as you don't
have a datapoint for each second.
So I wonder if the query per
TsDB has more columns than it appears at first glance. They store all of
the observations for a relatively long time interval in a single row.
You may have spotted that right off (I didn't).
On Sat, Apr 16, 2011 at 1:27 AM, Yves Langisch wrote:
> As I'm about to plan a similar app I have studi
Not sure if the secondary index helps his use case or not. Anyone has
experience on that?
On Sun, Jan 2, 2011 at 12:34 AM, Hari Sreekumar wrote:
> Ultimately it depends on how you will be accessing your data. If you need
> to
> query on the contract time frequently, then this approach wouldn't
Ultimately it depends on how you will be accessing your data. If you need to
query on the contract time frequently, then this approach wouldn't be great.
You have to identify the frequent queries and design schema according to
that. What are your frequent queries like?
Hari
On Sun, Jan 2, 2011 at
I think so. Unless you have some way to index the contract time (in HBase,
the only way doing so is to encode that information into your row-key), you
have to MapReduce to examine item by item.
On Tue, Dec 28, 2010 at 4:46 PM, Valter Nogueira wrote:
> And what about searching such contents?
>
> H
If there're (many) other tables besides table A, the data may not be evenly
distributed across cluster.
See https://issues.apache.org/jira/browse/HBASE-3373
On Sat, Jan 1, 2011 at 2:46 AM, Eric wrote:
> I have little experience with HBase so far, but my feeling says it should
> not matter how m
I have little experience with HBase so far, but my feeling says it should
not matter how much rows you store and that it's better to save on cpu time
and bandwidth. HBase will distribute the data evenly over your cluster and
should be very good at making rows accessible quickly by key because it's
Consider using TableInputFormat.
For serialization, one more choice is Avro.
On Tue, Dec 28, 2010 at 4:46 PM, Valter Nogueira wrote:
> And what about searching such contents?
>
> How to search for overdued contracts?
>
> I could read every contract thru map-reduce, select overdued contracts and
And what about searching such contents?
How to search for overdued contracts?
I could read every contract thru map-reduce, select overdued contracts and
build a table with such contracts - is that the right approach?
Valter
2010/12/28 Sean Bigdatafun
> I'd suggest json object, or xml, or any
I'd suggest json object, or xml, or any binary protocol buffer such as
Google PB, Facebook Thrift PB.
If you use any of those, you will have much better control over version
upgrade
On Tue, Dec 28, 2010 at 4:16 PM, Valter Nogueira wrote:
> Since contract has attributes such NUMBER, TOTAL, ACC
Since contract has attributes such NUMBER, TOTAL, ACCOUNT and soon
When doing the follow:
row_key || CF: Contract
-
valter|| 'C11' | info_for_11 | 'C12' | info_for_12
---
1. customer_table:
row_key --> column_family : (customer --> contract)
An example row,
row_key || CF: Contract
-
valter|| 'C11' | info_for_11 | 'C12' | info_for_12
---
Another approach is to denormalize everything into the customer table.
On Tue, Dec 28, 2010 at 3:26 PM, Valter Nogueira wrote:
> I have a small JAVA system using relational database.
>
> Basically, the app have 3 entities: CUSTOMER has many CONTRACTs and each
> CONTRACT has many INSTALLMENTS
>
>
> From: jg...@fb.com
> To: user@hbase.apache.org
> Subject: RE: Schema design, one-to-many question
> Date: Tue, 30 Nov 2010 16:11:14 +
>
> I'm not sure I agree that "you can not think of relationships".
>
> There is in fact a one-to-ma
ater. With column-orientation,
you can have the user as the row and stuff all of his relations into that same
row.
JG
> -Original Message-
> From: Michael Segel [mailto:michael_se...@hotmail.com]
> Sent: Tuesday, November 30, 2010 5:32 AM
> To: user@hbase.apache.org
> Subje
I'm sorry if this has already been answered, but I'll share my $0.02 anyway...
First, you and everyone have to stop thinking of hbase in terms of a relational
model. Because Hbase doesn't have the concept of joins, you can not think of
relationships.
If you have two tables where the primary ke
rya...@gmail.com]
> Sent: Monday, November 29, 2010 5:13 PM
> To: user@hbase.apache.org
> Subject: Re: Schema design, one-to-many question
>
> I am using 0.89 currently, does it include those optimizations set for
> 0.90? If so, great news, the wide table approach is what I preferred
I am using 0.89 currently, does it include those optimizations set for 0.90? If
so, great news, the wide table approach is what I preferred.
On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote:
> Hey Bryan,
>
> All of these approaches could work and seem sane.
>
> My preference these days would b
Hey Bryan,
All of these approaches could work and seem sane.
My preference these days would be the wide-table approach (#2, 3, 4) rather
than the tall table. Previously #1 was more efficient but in 0.90 and beyond
the same optimizations exist for both tall and wide tables.
For #2, I would pro
we have a ssimilar usecase, millions of user and each user with
different number of goods, from one to tens of thousands. we use
approach 2
Bryan Keller 编写:
I have read comments on modeling one-to-many relationships in HBase and
wanted to get some feedback. I have millions of customers, and each
70 matches
Mail list logo