Re: crafting your key - scan vs. get

2012-10-17 Thread Michael Segel
Neil, 


Since you asked 
Actually your question is kind of a boring question. ;-) [Note I will probably 
get flamed for saying it, even if it is the truth!]

Having said that...
Boring as it is, its an important topic that many still seem to trivialize in 
terms of its impact on performance. 

Before answering your question, lets take a step back and ask a more important 
question... 
"What data do you to capture and store in HBase?"
and then ask yourself...
"How do I plan on accessing the data?"

From what I can tell, you want to track certain events made by a user. 
So you're recording at Time X, user A did something. 

Then the question is how do you want to access the data.

Do you primarily say "Show me all the events in the past 15 minutes and 
organize them by user?" 
Or do you say "Show me the most recent events by user A" ?

Here's the issue. 

If you are more interested and will frequently ask the question of "Show me the 
most recent events by user A", 

Then you would want to do the following:
Key = User ID (hashed if necessary) 
Column Family: Data (For lack of a better name) 

Then store each event in a separate column where the column name is something 
like "event" + (max Long - Time Stamp) .

This will place the most recent event first.

The reason I say "event" + the long, is that you may want to place user 
specific information in a column and you would want to make sure it was in 
front of the event data.

Now if your access pattern was more along the lines of show me  the events that 
occurred in the past 15 minutes, then you would use the time stamp and then 
have to worry about hot spotting and region splits. But then you could get your 
data from a simple start/stop row scan. 

In the first case, you can use get() while still a scan, its a very efficient 
fetch. 
In the second, you will always need to do a scan. 

Having said that, there are other things to think about including frequency and 
how wide your rows will get over time. 
(Mainly in terms of the first example I gave.) 

The reason I said that your question is boring is that its been asked numerous 
times and every time its asked, the initial question doesn't provide enough 
information to actually give a good answer...

HTH

-Mike



On Oct 16, 2012, at 4:53 PM, Neil Yalowitz  wrote:

> Hopefully this is a fun question.  :)
> 
> Assume you could architect an HBase table from scratch and you were
> choosing between the following two key structures.
> 
> 1)
> 
> The first structure creates a unique row key for each PUT.  The rows are
> events related to a user ID.  There may be up to several hundred events for
> each user ID (probably not thousands, an average of perhaps ~100 events per
> user).  Each key would be made unique with a reverse-order-timestamp or
> perhaps just random characters (we don't particularly care about using ROT
> for sorting newest here).
> 
> key
> 
> AA + some-unique-chars
> 
> The table will look like this:
> 
> key   vals  cf:mycfts
> ---
> AA... myval1 1350345600
> AA... myval2 1350259200
> AA... myval3 1350172800
> 
> 
> Retrieving these values will use a Scan with startRow and stopRow.  In
> hbase shell, it would look like:
> 
> $ scan 'mytable',{STARTROW=>'AA', ENDROW=>'AA_'}
> 
> 
> 2)
> 
> The second structure choice uses only the user ID as the key and relies on
> row versions to store all the events.  For example:
> 
> key   vals   cf:mycf ts
> -
> AAmyval1   1350345600
> AAmyval2   1350259200
> AAmyval3   1350172800
> 
> Retrieving these values will use a Get with VERSIONS = somebignumber.  In
> hbase shell, it would look like:
> 
> $ get 'mytable','AA',{COLUMN=>'cf:mycf', VERSIONS=>999}
> 
> ...although this probably violates a comment in the HBase documentation:
> 
> "It is not recommended setting the number of max versions to an exceedingly
> high level (e.g., hundreds or more) unless those old values are very dear
> to you because this will greatly increase StoreFile size."
> 
> ...found here: http://hbase.apache.org/book/schema.versions.html
> 
> 
> So, are there any performance considerations between Scan vs. Get in this
> use case?  Which choice would you go for?
> 
> 
> 
> Neil Yalowitz
> neilyalow...@gmail.com



Re: crafting your key - scan vs. get

2012-10-17 Thread Neil Yalowitz
This is a helpful response, thanks.  Our use case fits the "Show me the
most recent events by user A" you described.

So using the first example, a table populated with events of user ID AA.

ROWCOLUMN+CELL


 AA
 column=data:event, timestamp=1350420705459, value=myeventval1


 AA
 column=data:event9998, timestamp=1350420704490, value=myeventval2


 AA
 column=data:event9997, timestamp=1350420704567, value=myeventval3

NOTE1: I replaced the TS stuff with ...9997 for brevity, and the
example user ID "AA" would actually be hashed to avoid hotspotting
NOTE2: I assume I should shorten the chosen column family and qualifier
before writing it to a large production table (for instance, d instead of
data and e instead of event)

I hope I have that right.  Thanks for the response!

As for including enough description for the question to be "not-boring,"
I'm never quite sure when an email will grow so long that no one will read
it.  :)  So to give more background: Each event is about 1KB of data.  The
frequency is highly variable... over any given period of time, some users
may only log one event and no more, some users may log a few events (10 to
100), in some rare cases a user may log many events (1000+).  The width of
the column is some concern for the users with many events, but I'm thinking
a few rare rows with 1KB x 1000+ width shouldn't kill us.

If I may ask a couple of followup question about your comments:

> Then store each event in a separate column where the column name is
something like "event" + (max Long - Time Stamp) .
>
> This will place the most recent event first.

Although I know row keys are sorted, I'm not sure what this means for a
qualifier.  The scan result can depend on what cf:qual is used?  ...and
that determines which column value is "first"?  Is this related to using
setMaxResultsPerColumnFamily(1)?  (ie-- only return one column value, so
sort on qualifier and return the first val found)

> The reason I say "event" + the long, is that you may want to place user
specific information in a column and you would want to make sure it was in
front of the event data.

Same question as above, I'm not sure what would place a column "in front."
 Am I missing something?

> In the first case, you can use get() while still a scan, its a very
efficient fetch.
> In the second, you will always need to do a scan.

This is the core of my original question.  My anecdotal tests in hbase
shell showed a Get executing about 3x faster than a Scan with
start/stoprow, but I don't trust my crude testing much and hoped someone
could describe the performance trade-off between Scan vs. Get.


Thanks again for anyone who read this far.


Neil Yalowitz
neilyalow...@gmail.com

On Wed, Oct 17, 2012 at 10:45 AM, Michael Segel
wrote:

> Neil,
>
>
> Since you asked
> Actually your question is kind of a boring question. ;-) [Note I will
> probably get flamed for saying it, even if it is the truth!]
>
> Having said that...
> Boring as it is, its an important topic that many still seem to trivialize
> in terms of its impact on performance.
>
> Before answering your question, lets take a step back and ask a more
> important question...
> "What data do you to capture and store in HBase?"
> and then ask yourself...
> "How do I plan on accessing the data?"
>
> From what I can tell, you want to track certain events made by a user.
> So you're recording at Time X, user A did something.
>
> Then the question is how do you want to access the data.
>
> Do you primarily say "Show me all the events in the past 15 minutes and
> organize them by user?"
> Or do you say "Show me the most recent events by user A" ?
>
> Here's the issue.
>
> If you are more interested and will frequently ask the question of "Show
> me the most recent events by user A",
>
> Then you would want to do the following:
> Key = User ID (hashed if necessary)
> Column Family: Data (For lack of a better name)
>
> Then store each event in a separate column where the column name is
> something like "event" + (max Long - Time Stamp) .
>
> This will place the most recent event first.
>
> The reason I say "event" + the long, is that you may want to place user
> specific information in a column and you would want to make sure it was in
> front of the event data.
>
> Now if your access pattern was more along the lines of show me  the events
> that occurred in the past 15 minutes, then you would use the time stamp and
> then have to worry about hot spotting and region splits. But then you could
> get your data from a simple start/stop row scan.
>
> In the first case, you can use get() while still a scan, its a very
> efficient fetch.
> In the second, you will always need to do a scan.
>
> Having said that, there are other things to think about including
> frequency and how wide your rows will get over time.
> (Mainly in terms of the first example I gave.)
>
> The reason I said that your questio

Re: crafting your key - scan vs. get

2012-10-18 Thread Michael Segel
Neil, 

I've pointed you in the right direction. 
The rest of the exercise is left to the student. :-) 

While you used the comment about having fun, your question is boring. *^1
The fun part is for you now to play and see why I may have suggested the 
importance of column order.

Sorry, but that really is the fun part of your question... figuring out the 
rest of the answer on your own. 

From your response, you clearly understand it, but you need to spend more time 
wrapping your head around the solution and taking ownership of it. 

Have fun, 

-Mike


*^1  The reason I say that the question is boring is that once you fully 
understand the problem and the solution, you can easily apply it to other 
problems. The fun is in actually taking the time to experiment and work through 
the problem on your own. Seriously, that *is* the fun part.


On Oct 17, 2012, at 10:53 PM, Neil Yalowitz  wrote:

> This is a helpful response, thanks.  Our use case fits the "Show me the
> most recent events by user A" you described.
> 
> So using the first example, a table populated with events of user ID AA.
> 
> ROWCOLUMN+CELL
> 
> 
> AA
> column=data:event, timestamp=1350420705459, value=myeventval1
> 
> 
> AA
> column=data:event9998, timestamp=1350420704490, value=myeventval2
> 
> 
> AA
> column=data:event9997, timestamp=1350420704567, value=myeventval3
> 
> NOTE1: I replaced the TS stuff with ...9997 for brevity, and the
> example user ID "AA" would actually be hashed to avoid hotspotting
> NOTE2: I assume I should shorten the chosen column family and qualifier
> before writing it to a large production table (for instance, d instead of
> data and e instead of event)
> 
> I hope I have that right.  Thanks for the response!
> 
> As for including enough description for the question to be "not-boring,"
> I'm never quite sure when an email will grow so long that no one will read
> it.  :)  So to give more background: Each event is about 1KB of data.  The
> frequency is highly variable... over any given period of time, some users
> may only log one event and no more, some users may log a few events (10 to
> 100), in some rare cases a user may log many events (1000+).  The width of
> the column is some concern for the users with many events, but I'm thinking
> a few rare rows with 1KB x 1000+ width shouldn't kill us.
> 
> If I may ask a couple of followup question about your comments:
> 
>> Then store each event in a separate column where the column name is
> something like "event" + (max Long - Time Stamp) .
>> 
>> This will place the most recent event first.
> 
> Although I know row keys are sorted, I'm not sure what this means for a
> qualifier.  The scan result can depend on what cf:qual is used?  ...and
> that determines which column value is "first"?  Is this related to using
> setMaxResultsPerColumnFamily(1)?  (ie-- only return one column value, so
> sort on qualifier and return the first val found)
> 
>> The reason I say "event" + the long, is that you may want to place user
> specific information in a column and you would want to make sure it was in
> front of the event data.
> 
> Same question as above, I'm not sure what would place a column "in front."
> Am I missing something?
> 
>> In the first case, you can use get() while still a scan, its a very
> efficient fetch.
>> In the second, you will always need to do a scan.
> 
> This is the core of my original question.  My anecdotal tests in hbase
> shell showed a Get executing about 3x faster than a Scan with
> start/stoprow, but I don't trust my crude testing much and hoped someone
> could describe the performance trade-off between Scan vs. Get.
> 
> 
> Thanks again for anyone who read this far.
> 
> 
> Neil Yalowitz
> neilyalow...@gmail.com
> 
> On Wed, Oct 17, 2012 at 10:45 AM, Michael Segel
> wrote:
> 
>> Neil,
>> 
>> 
>> Since you asked
>> Actually your question is kind of a boring question. ;-) [Note I will
>> probably get flamed for saying it, even if it is the truth!]
>> 
>> Having said that...
>> Boring as it is, its an important topic that many still seem to trivialize
>> in terms of its impact on performance.
>> 
>> Before answering your question, lets take a step back and ask a more
>> important question...
>> "What data do you to capture and store in HBase?"
>> and then ask yourself...
>> "How do I plan on accessing the data?"
>> 
>> From what I can tell, you want to track certain events made by a user.
>> So you're recording at Time X, user A did something.
>> 
>> Then the question is how do you want to access the data.
>> 
>> Do you primarily say "Show me all the events in the past 15 minutes and
>> organize them by user?"
>> Or do you say "Show me the most recent events by user A" ?
>> 
>> Here's the issue.
>> 
>> If you are more interested and will frequently ask the question of "Show
>> me the most recent events by user A",
>> 
>> Then you would want to do the followin

Re: crafting your key - scan vs. get

2012-10-18 Thread Ian Varley
Hi Neil,

Mike summed it up well, as usual. :) Your choices of where to describe this 
"dimension" of your data (a one-to-many between users and events) are:

 - one row per event
 - one row per user, with events as columns
 - one row per user, with events as versions on a single cell

The first two are the best choices, since the third is sort of a perversion of 
the time dimension (it isn't one thing that's changing, it's many things over 
time), and might make things counter-intuitive when combined with deletes, 
compaction, etc. You can do it, but caveat emptor. :)

Since you have in the 100s or 1000s of events per user, it's reasonable to use 
the 2nd (columns). And with 1k cell sizes, even extreme cases (thousands of 
events) won't kill you.

That said, the main plus you get out of using columns over rows is ACID 
properties; you could get & set all the stuff for a single user atomically if 
it's columns in a single row, but not if its separate rows. That's nice, but 
I'm guessing you probably don't need to do that, and instead would write out 
the events as they happen (i.e., you would rarely be doing PUTs for multiple 
events for the same user at the same time, right?).

In theory, tall tables (the row-wise model) should have a slight performance 
advantage over wide tables (the column-wise model), all other things being 
equal; the shape of the data is nearly the same, but the row-wise version 
doesn't have to do any work preserving consistency. Your informal tests about 
GET vs SCAN perf seem a little suspect, since a GET is actually implemented as 
a one-row SCAN; but the devil's in the details, so if you see that happening 
repeatably with data that's otherwise identical, raise it up to the dev list 
and people should look at it.

The key thing is to try it for yourself and see. :)

Ian

ps - Sorry Mike was rude to you in his response. Your question was well-phrased 
and not at all boring. Mike, you can explain all you want, but saying "Your 
question is boring" is straight up rude; please don't do that.


From: Neil Yalowitz mailto:neilyalow...@gmail.com>>
Date: Tue, Oct 16, 2012 at 2:53 PM
Subject: crafting your key - scan vs. get
To: user@hbase.apache.org


Hopefully this is a fun question.  :)

Assume you could architect an HBase table from scratch and you were
choosing between the following two key structures.

1)

The first structure creates a unique row key for each PUT.  The rows are
events related to a user ID.  There may be up to several hundred events for
each user ID (probably not thousands, an average of perhaps ~100 events per
user).  Each key would be made unique with a reverse-order-timestamp or
perhaps just random characters (we don't particularly care about using ROT
for sorting newest here).

key

AA + some-unique-chars

The table will look like this:

key   vals  cf:mycfts
---
AA... myval1 1350345600
AA... myval2 1350259200
AA... myval3 1350172800


Retrieving these values will use a Scan with startRow and stopRow.  In
hbase shell, it would look like:

$ scan 'mytable',{STARTROW=>'AA', ENDROW=>'AA_'}


2)

The second structure choice uses only the user ID as the key and relies on
row versions to store all the events.  For example:

key   vals   cf:mycf ts
-
AAmyval1   1350345600
AAmyval2   1350259200
AAmyval3   1350172800

Retrieving these values will use a Get with VERSIONS = somebignumber.  In
hbase shell, it would look like:

$ get 'mytable','AA',{COLUMN=>'cf:mycf', VERSIONS=>999}

...although this probably violates a comment in the HBase documentation:

"It is not recommended setting the number of max versions to an exceedingly
high level (e.g., hundreds or more) unless those old values are very dear
to you because this will greatly increase StoreFile size."

...found here: http://hbase.apache.org/book/schema.versions.html


So, are there any performance considerations between Scan vs. Get in this
use case?  Which choice would you go for?



Neil Yalowitz
neilyalow...@gmail.com



Re: crafting your key - scan vs. get

2012-10-19 Thread Neil Yalowitz
Thanks Ian!  Very helpful breakdown.

For this use case, I think the multi-version row structure is ruled out.
We will investigate the onekey-manycolumn approach.  Also, the more I study
the mechanics behind a SCAN vs GET, the more I believe the informal test I
did is inaccurate.  What does warrant a look, however, are the filters on
the scan.  We are already filtering on CF but we can now look at filtering
on qualifiers as well.

Thanks again,

Neil Yalowitz
neilyalow...@gmail.com

On Thu, Oct 18, 2012 at 4:59 PM, Ian Varley  wrote:

> Hi Neil,
>
> Mike summed it up well, as usual. :) Your choices of where to describe
> this "dimension" of your data (a one-to-many between users and events) are:
>
>  - one row per event
>  - one row per user, with events as columns
>  - one row per user, with events as versions on a single cell
>
> The first two are the best choices, since the third is sort of a
> perversion of the time dimension (it isn't one thing that's changing, it's
> many things over time), and might make things counter-intuitive when
> combined with deletes, compaction, etc. You can do it, but caveat emptor. :)
>
> Since you have in the 100s or 1000s of events per user, it's reasonable to
> use the 2nd (columns). And with 1k cell sizes, even extreme cases
> (thousands of events) won't kill you.
>
> That said, the main plus you get out of using columns over rows is ACID
> properties; you could get & set all the stuff for a single user atomically
> if it's columns in a single row, but not if its separate rows. That's nice,
> but I'm guessing you probably don't need to do that, and instead would
> write out the events as they happen (i.e., you would rarely be doing PUTs
> for multiple events for the same user at the same time, right?).
>
> In theory, tall tables (the row-wise model) should have a slight
> performance advantage over wide tables (the column-wise model), all other
> things being equal; the shape of the data is nearly the same, but the
> row-wise version doesn't have to do any work preserving consistency. Your
> informal tests about GET vs SCAN perf seem a little suspect, since a GET is
> actually implemented as a one-row SCAN; but the devil's in the details, so
> if you see that happening repeatably with data that's otherwise identical,
> raise it up to the dev list and people should look at it.
>
> The key thing is to try it for yourself and see. :)
>
> Ian
>
> ps - Sorry Mike was rude to you in his response. Your question was
> well-phrased and not at all boring. Mike, you can explain all you want, but
> saying "Your question is boring" is straight up rude; please don't do that.
>
>
> From: Neil Yalowitz mailto:neilyalow...@gmail.com
> >>
> Date: Tue, Oct 16, 2012 at 2:53 PM
> Subject: crafting your key - scan vs. get
> To: user@hbase.apache.org
>
>
> Hopefully this is a fun question.  :)
>
> Assume you could architect an HBase table from scratch and you were
> choosing between the following two key structures.
>
> 1)
>
> The first structure creates a unique row key for each PUT.  The rows are
> events related to a user ID.  There may be up to several hundred events for
> each user ID (probably not thousands, an average of perhaps ~100 events per
> user).  Each key would be made unique with a reverse-order-timestamp or
> perhaps just random characters (we don't particularly care about using ROT
> for sorting newest here).
>
> key
> 
> AA + some-unique-chars
>
> The table will look like this:
>
> key   vals  cf:mycfts
> ---
> AA... myval1 1350345600
> AA... myval2 1350259200
> AA... myval3 1350172800
>
>
> Retrieving these values will use a Scan with startRow and stopRow.  In
> hbase shell, it would look like:
>
> $ scan 'mytable',{STARTROW=>'AA', ENDROW=>'AA_'}
>
>
> 2)
>
> The second structure choice uses only the user ID as the key and relies on
> row versions to store all the events.  For example:
>
> key   vals   cf:mycf ts
> -
> AAmyval1   1350345600
> AAmyval2   1350259200
> AAmyval3   1350172800
>
> Retrieving these values will use a Get with VERSIONS = somebignumber.  In
> hbase shell, it would look like:
>
> $ get 'mytable','AA',{COLUMN=>'cf:mycf', VERSIONS=>999}
>
> ...although this probably violates a comment in the HBase documentation:
>
> "It is not recommended setting the number of max versions to an exceedingly
> high level (e.g., hundreds or more) unless those old values are very dear
> to you because this will greatly increase StoreFile size."
>
> ...fou