Re: Columns limit

Benjamin Black Sat, 07 Aug 2010 22:06:42 -0700

Certainly.  There is also a performance penalty to unbounded row
sizes.  That penalty is your nodes OOMing.  I strongly recommend you
abandon that direction.


On Sat, Aug 7, 2010 at 9:06 PM, Mark <static.void....@gmail.com> wrote:
> On 8/7/10 7:04 PM, Benjamin Black wrote:
>>
>> certainly it matters: your previous version is not bounded on time, so
>> will grow without bound.  ergo, it is not a good fit for cassandra.
>>
>> On Sat, Aug 7, 2010 at 2:51 PM, Mark<static.void....@gmail.com>  wrote:
>>
>>>
>>> On 8/7/10 2:33 PM, Benjamin Black wrote:
>>>
>>>>
>>>> Right, this is an index row per time interval (your previous email was
>>>> not).
>>>>
>>>> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void....@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> On 8/7/10 11:30 AM, Mark wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On 8/7/10 4:22 AM, Thomas Heller wrote:
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Ok, I think the part I was missing was the concatenation of the key
>>>>>>>> and
>>>>>>>> partition to do the look ups. Is this the preferred way of
>>>>>>>> accomplishing
>>>>>>>> needs such as this? Are there alternatives ways?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Depending on your needs you can concat the row key or use super
>>>>>>> columns.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> How would one then "query" over multiple days? Same question for all
>>>>>>>> days.
>>>>>>>> Should I use range_slice or multiget_slice? And if its range_slice
>>>>>>>> does
>>>>>>>> that
>>>>>>>> mean I need OrderPreservingPartitioner?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
>>>>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
>>>>>>> and use multiget_slice.
>>>>>>>
>>>>>>> If you want to get all days where a specific ip address had some
>>>>>>> requests you'll just need another CF where the row key is the addr
>>>>>>> and
>>>>>>> column names are the days (values optional again). Pretty much the
>>>>>>> same all over again, just add another CF and insert the data you
>>>>>>> need.
>>>>>>>
>>>>>>> get_range_slice in my experience is better used for "offline" tasks
>>>>>>> where you really want to process every row there is.
>>>>>>>
>>>>>>> /thomas
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Ok... as an example using looking up logs by ip for a certain
>>>>>> timeframe/range would this work?
>>>>>>
>>>>>> <ColumnFamily Name="SearchLog"/>
>>>>>>
>>>>>> <ColumnFamily Name="IPSearchLog"
>>>>>>                           ColumnType="Super"
>>>>>>                           CompareWith="UTF8Type"
>>>>>>                           CompareSubcolumnsWith="TimeUUIDType"/>
>>>>>>
>>>>>> Resulting in a structure like:
>>>>>>
>>>>>> {
>>>>>>  "127.0.0.1" : {
>>>>>>       "2010080711" : {
>>>>>>            uuid1 : ""
>>>>>>            uuid2: ""
>>>>>>            uuid3: ""
>>>>>>       }
>>>>>>      "2010080712" : {
>>>>>>            uuid1 : ""
>>>>>>            uuid2: ""
>>>>>>            uuid3: ""
>>>>>>       }
>>>>>>   }
>>>>>>  "some.other.ip" : {
>>>>>>       "2010080711" : {
>>>>>>            uuid1 : ""
>>>>>>       }
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> Whereas each uuid is the key used for SearchLog.  Is there anything
>>>>>> wrong
>>>>>> with this? I know there is a 2 billion column limit but in this case
>>>>>> that
>>>>>> would never be exceeded because each column represents an hour.
>>>>>> However
>>>>>> does
>>>>>> the above "schema" imply that for any certain IP there can only be a
>>>>>> maxium
>>>>>> of 2GB of data stored?
>>>>>>
>>>>>>
>>>>>
>>>>> Or should I invert the ip with the time slices? The limitation of this
>>>>> seems
>>>>> like there can only be 2 billion unique ips per hour which is more than
>>>>> enough for our application :)
>>>>>
>>>>> {
>>>>>  "2010080711" : {
>>>>>       "127.0.0.1" : {
>>>>>            uuid1 : ""
>>>>>            uuid2: ""
>>>>>            uuid3: ""
>>>>>       }
>>>>>      "some.other.ip" : {
>>>>>            uuid1 : ""
>>>>>            uuid2: ""
>>>>>            uuid3: ""
>>>>>       }
>>>>>   }
>>>>>  "2010080712" : {
>>>>>       "127.0.0.1" : {
>>>>>            uuid1 : ""
>>>>>       }
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> In the end does it really matter which one to go with? I kind of like the
>>> previous version so I don't have to build up all the keys for the
>>> multi_get
>>> and instead I can just provide and start&  finish for the columns (time
>>> frames).
>>>
>>>
>
> Is there any performance penalty for a multi_get that includes x keys versus
> a get on 1 key with a start/finish range of x?
>
> Using your gem,
>
> multi_get("SearchLog", ["20090101"..."20100807"], "127.0.0.1")
> vs
> get("SearchLog", "127.0.0.1", :start => "20090101", :finish => ""127.0.0.1")
>
> Thanks
>

Re: Columns limit

Reply via email to