RES: Schema for sorted results

Cristofer Weber Tue, 24 Jul 2012 08:22:22 -0700

Hi Hari,

Using date as column qualifier is nice, but I experienced a drawback in a 
scenario where I left the window open: I kept a large range of dates per RowKey 
and the amount of rows per region became lower and lower as I started to split 
regions.


You can manage this with TTL if you don't need this data after some time, using 
HDFS to store older data (or even a different table or different RowKey 
pattern). You can also keep date as part of  your RowKey as you showed us 
before, there's nothing wrong with that as you realized that categories fits 
better as first component of your RowKey. Or you can create a hybrid, with 
year+month in your RowKey and days as Column Qualifiers. 

The way you query your data should be in your design considerations.

Regards,
Cristofer

-----Mensagem original-----
De: Hari Prasanna [mailto:h...@slideshare.com] 
Enviada em: terça-feira, 24 de julho de 2012 11:50
Para: user@hbase.apache.org
Assunto: Re: Schema for sorted results

JM - I am searching for top N urls in date+category, so this rowkey does work 
well for the my purpose.
Cristofer - I realize that having the raw date at the beginning of the rowkey 
makes all the writes in a day rush to the same region server.
Maybe I could have the rowkey start with the category(which is more
distributed) and have date in the column qualifier.
I just went through the slides. Was very enlightening. thanks for that.

Thank again!

On Tue, Jul 24, 2012 at 7:59 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org> 
wrote:
> Hi Hari,
>
> Why do you think it's wasteful?
>
> Let's imagine this situation.
> Key=<date>|<category>|<padded_visits>|<url> Value = nothing.
>
> And this one:
> Key=<url> Value = <date>|<category>|<padded_visits>
>
> Both situation will, at the end, represent almost the same size in the 
> database.
>
> You can also do somthing like that:
> Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits>
>
> Just that the first option will allow you to retreive the information 
> you are looking for very quickly.
>
> Now, are you sure that this key is really what you need? What will be 
> the access model for your database? With the key you are using, you 
> will have to search by date first. So if you want to fine all the 
> entries for one URL, you will have to scan the entire table, jumping 
> to the next date each time you find it.
>
> If you are searching by date, then this key is good.
>
> So you really need first to think on the way you are going to read 
> your data, and then, you will be able to design a key to match your 
> needs.
>
> JM
>
> 2012/7/24, Minh Duc Nguyen <mdngu...@gmail.com>:
>> Hari,
>>
>>    According to the HBase book: 
>> http://hbase.apache.org/book.html#dm.sort
>>
>> All data model operations HBase return data in sorted order. First by 
>> row, then by ColumnFamily, followed by column qualifier, and finally 
>> timestamp (sorted in reverse, so newest records are returned first).
>>
>>     ~ Minh
>>
>> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <h...@slideshare.com> wrote:
>>
>>> Hello -
>>>
>>> I'm using HBase for web server log processing and I'm trying to save 
>>> the top N urls per category per day in a sorted manner in HBase. 
>>> From what I've read, the only sortable structure that HBase offers 
>>> is the lexicographic sort in the row keys. So, here is the rowkey 
>>> format I'm currently using <date>|<category>|<padded_visits>|<url>
>>> where,  padded_visits = Long.MAX_VALUE - visits
>>>
>>> This seems wasteful because of the long rowkeys. Is there any other 
>>> approach to maintain sorted results in HBase?
>>>
>>> Thanks
>>> Hari Prasanna
>>>
>>



--
Hari

RES: Schema for sorted results

Reply via email to