Re: Schema Design

Wangpei (Peter) Wed, 26 Jan 2011 19:26:41 -0800

I am also working on a system store logs from hundreds system.
In my scenario, most query will like this: "let's look at login logs (category 
EQ) of that proxy (host EQ) between this Monday and Wednesday(time range)."
My data model like this:
. only 1 CF. that's enough for this scenario.
. group logs from each host and day to one row. Key format is 
"hostname.category.date"
. store each log entry as a super column, super olumn name is TimeUUID of the 
log. each attribute as a column.


Then this query can be done as 3 GET, no need to do key range scan.
Then I can use RP instead of OPP. If I use OPP, I have to worry about load 
balance myself. I hate that.
However, if I need to do a time range access, I can still use column slice.

An additional benefit is, I can clean old logs very easily. We only store logs 
in 1 year. Just deleting by keys can do this job well.

I think storing all logs for a host in a single row is not a good choice. 2 
reason:
1, too few keys, so your data will not distributing well.
2, data under a key will always increase. So Cassandra have to do more SSTable 
compaction.

-----邮件原件-----
发件人: William R Speirs [mailto:bill.spe...@gmail.com] 
发送时间: 2011年1月27日 9:15
收件人: user@cassandra.apache.org
主题: Re: Schema Design

It makes sense that the single row for a system (with a growing number of 
columns) will reside on a single machine.

With that in mind, here is my updated schema:

- A single column family for all the messages. The row keys will be the 
TimeUUID 
of the message with the following columns: date/time (in UTC POSIX), system 
name/id (with an index for fast/easy gets), the actual message payload.

- A column family for each system. The row keys will be UTC POSIX time with 1 
second (maybe 1 minute) bucketing, and the column names will be the TimeUUID of 
any messages that were logged during that time bucket.

My only hesitation with this design is that buddhasystem warned that each 
column 
family, "is allocated a piece of memory on the server." I'm not sure what the 
implications of this are and/or if this would be a problem if a I had a number 
of systems on the order of hundreds.

Thanks...

Bill-

On 01/26/2011 06:51 PM, Shu Zhang wrote:
> Each row can have a maximum of 2 billion columns, which a logging system will 
> probably hit eventually.
>
> More importantly, you'll only have 1 row per set of system logs. Every row is 
> stored on the same machine(s), which you means you'll definitely not be able 
> to distribute your load very well.
> ________________________________________
> From: Bill Speirs [bill.spe...@gmail.com]
> Sent: Wednesday, January 26, 2011 1:23 PM
> To: user@cassandra.apache.org
> Subject: Re: Schema Design
>
> I like this approach, but I have 2 questions:
>
> 1) what is the implications of continually adding columns to a single
> row? I'm unsure how Cassandra is able to grow. I realize you can have
> a virtually infinite number of columns, but what are the implications
> of growing the number of columns over time?
>
> 2) maybe it's just a restriction of the CLI, but how do I do issue a
> slice request? Also, what if start (or end) columns don't exist? I'm
> guessing it's smart enough to get the columns in that range.
>
> Thanks!
>
> Bill-
>
> On Wed, Jan 26, 2011 at 4:12 PM, David McNelis
> <dmcne...@agentisenergy.com>  wrote:
>> I would say in that case you might want  to try a  single column family
>> where the key to the column is the system name.
>> Then, you could name your columns as the timestamp.  Then when retrieving
>> information from the data store you can can, in your slice request, specify
>> your start column as  X and end  column as Y.
>> Then you can use the stored column name to know when an event  occurred.
>>
>> On Wed, Jan 26, 2011 at 2:56 PM, Bill Speirs<bill.spe...@gmail.com>  wrote:
>>>
>>> I'm looking to use Cassandra to store log messages from various
>>> systems. A log message only has a message (UTF8Type) and a data/time.
>>> My thought is to create a column family for each system. The row key
>>> will be a TimeUUIDType. Each row will have 7 columns: year, month,
>>> day, hour, minute, second, and message. I then have indexes setup for
>>> each of the date/time columns.
>>>
>>> I was hoping this would allow me to answer queries like: "What are all
>>> the log messages that were generated between X&  Y?" The problem is
>>> that I can ONLY use the equals operator on these column values. For
>>> example, I cannot issuing: get system_x where month>  1; gives me this
>>> error: "No indexed columns present in index clause with operator EQ."
>>> The equals operator works as expected though: get system_x where month
>>> = 1;
>>>
>>> What schema would allow me to get date ranges?
>>>
>>> Thanks in advance...
>>>
>>> Bill-
>>>
>>> * ColumnFamily description *
>>>     ColumnFamily: system_x_msg
>>>       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>>>       Row cache size / save period: 0.0/0
>>>       Key cache size / save period: 200000.0/3600
>>>       Memtable thresholds: 1.1671875/249/60
>>>       GC grace seconds: 864000
>>>       Compaction min/max thresholds: 4/32
>>>       Read repair chance: 1.0
>>>       Built indexes: [proj_1_msg.646179, proj_1_msg.686f7572,
>>> proj_1_msg.6d696e757465, proj_1_msg.6d6f6e7468,
>>> proj_1_msg.7365636f6e64, proj_1_msg.79656172]
>>>       Column Metadata:
>>>         Column Name: year (year)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: month (month)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: second (second)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: minute (minute)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: hour (hour)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>>         Column Name: day (day)
>>>           Validation Class: org.apache.cassandra.db.marshal.IntegerType
>>>           Index Type: KEYS
>>
>>
>>
>> --
>> David McNelis
>> Lead Software Engineer
>> Agentis Energy
>> www.agentisenergy.com
>> o: 630.359.6395
>> c: 219.384.5143
>> A Smart Grid technology company focused on helping consumers of energy
>> control an often under-managed resource.
>>
>>

Re: Schema Design

Reply via email to