On 8/7/10 2:33 PM, Benjamin Black wrote:
Right, this is an index row per time interval (your previous email was not).

On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void....@gmail.com>  wrote:
On 8/7/10 11:30 AM, Mark wrote:
On 8/7/10 4:22 AM, Thomas Heller wrote:
Ok, I think the part I was missing was the concatenation of the key and
partition to do the look ups. Is this the preferred way of accomplishing
needs such as this? Are there alternatives ways?
Depending on your needs you can concat the row key or use super columns.

How would one then "query" over multiple days? Same question for all
days.
Should I use range_slice or multiget_slice? And if its range_slice does
that
mean I need OrderPreservingPartitioner?
The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
'2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
and use multiget_slice.

If you want to get all days where a specific ip address had some
requests you'll just need another CF where the row key is the addr and
column names are the days (values optional again). Pretty much the
same all over again, just add another CF and insert the data you need.

get_range_slice in my experience is better used for "offline" tasks
where you really want to process every row there is.

/thomas
Ok... as an example using looking up logs by ip for a certain
timeframe/range would this work?

<ColumnFamily Name="SearchLog"/>

<ColumnFamily Name="IPSearchLog"
                           ColumnType="Super"
                           CompareWith="UTF8Type"
                           CompareSubcolumnsWith="TimeUUIDType"/>

Resulting in a structure like:

{
  "127.0.0.1" : {
       "2010080711" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
      "2010080712" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
   }
  "some.other.ip" : {
       "2010080711" : {
            uuid1 : ""
       }
   }
}

Whereas each uuid is the key used for SearchLog.  Is there anything wrong
with this? I know there is a 2 billion column limit but in this case that
would never be exceeded because each column represents an hour. However does
the above "schema" imply that for any certain IP there can only be a maxium
of 2GB of data stored?
Or should I invert the ip with the time slices? The limitation of this seems
like there can only be 2 billion unique ips per hour which is more than
enough for our application :)

{
  "2010080711" : {
       "127.0.0.1" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
      "some.other.ip" : {
            uuid1 : ""
            uuid2: ""
            uuid3: ""
       }
   }
  "2010080712" : {
       "127.0.0.1" : {
            uuid1 : ""
       }
   }
}


In the end does it really matter which one to go with? I kind of like the previous version so I don't have to build up all the keys for the multi_get and instead I can just provide and start & finish for the columns (time frames).

Reply via email to