Re: What is wrong in this token function

Matt Kennedy Fri, 11 Mar 2016 21:04:42 -0800

The conversation around the partitioner sidetracks a bit from your original
question.


You originally asked:
>>
Business case: Show me all events for a given customer in a given time frame

In RDBMS it will be

(Query1)
where customer_id = '289'
and event_time >= '2016-03-01 18:45:00+0000' and event_time <= '2016-03-12
19:05:00+0000'   ;"

But C* does not allow >= <= on PKY cols
<<

Actually, C* does allow range queries on _some_ primary key columns, just
not the partition key portion of the primary key columns. The primary key
that you are looking for is probably: ((customer_id),event_id). Structuring
the key like this uses customer_id as the partition key where you can use
an equality clause as show above (customer_id='289') followed by range
clauses on the evernt_id, which is now treated as a _clustering_column_.
Clustering columns are a concept that you probably want to look in to
further to wrap your head around this kind of query pattern.



On Thu, Mar 10, 2016 at 5:02 PM, Rakesh Kumar <dcrunch...@aim.com> wrote:

> thanks. that explains it.
>
>
> -----Original Message-----
> From: Jack Krupansky <jack.krupan...@gmail.com>
> To: user <user@cassandra.apache.org>
> Sent: Thu, Mar 10, 2016 5:28 pm
> Subject: Re: What is wrong in this token function
>
> From the doc: "When using the RandomPartitioner or Murmur3Partitioner,
> Cassandra rows are ordered by the hash of their value and hence the order
> of rows is not meaningful... The ByteOrdered partitioner arranges tokens
> the same way as key values, but the RandomPartitioner and
> Murmur3Partitioner distribute tokens in a completely unordered manner. The
> token function makes it possible to page through these unordered
> partitioner results."
>
> See:
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html (for 2.1)
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/usePaging.html (for
> 2.2 and 3.x)
>
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 5:14 PM, Rakesh Kumar <dcrunch...@aim.com> wrote:
>
>> I am using default Murmur3.  So are you saying in case of Murmur3 the
>> following two queries
>>
>> select count*)
>> where customer_id = '289'
>> and event_time >= '2016-03-01 18:45:00+0000' and event_time <=
>> '2016-03-12 19:05:00+0000'   ;
>> and
>> select count(*)
>> where token(customer_id,event_time) >= token('289','2016-03-01
>> 18:45:00+0000')
>> and token(customer_id,event_time) <= token('289','2016-03-12
>> 19:05:00+0000')  ;
>>
>> are not same ?
>>
>> And yes I am aware of how to change the clustering_key to get the first
>> query. This question is more of academic exercise for me.
>>
>>
>> -----Original Message-----
>> From: Jack Krupansky <jack.krupan...@gmail.com>
>> To: user <user@cassandra.apache.org>
>> Sent: Thu, Mar 10, 2016 4:55 pm
>> Subject: Re: What is wrong in this token function
>>
>> What partitioner are you using? The default partitioner is not "ordered",
>> so it will randomly order the hashes/tokens, so that tokens will not be
>> ordered even if your PKs are ordered. You probably want to use customer as
>> your partition key and event time as a clustering column - then you can use
>> RDBMS-like WHERE conditions to select a slice of the partition.
>>
>> -- Jack Krupansky
>>
>> On Thu, Mar 10, 2016 at 4:45 PM, Rakesh Kumar <dcrunch...@aim.com> wrote:
>>
>>>
>>> typo: the primary key was (customer_id + event_time )
>>>
>>>
>>> -----Original Message-----
>>> From: Rakesh Kumar <dcrunch...@aim.com>
>>> To: user <user@cassandra.apache.org>
>>> Sent: Thu, Mar 10, 2016 4:44 pm
>>> Subject: What is wrong in this token function
>>>
>>> C*  3.0.3
>>>
>>> I have a table table1 which has the primary key on
>>> ((customer_id,event_id)).
>>>
>>> I loaded 1.03 million rows from a csv file.
>>>
>>> Business case: Show me all events for a given customer in a given time
>>> frame
>>>
>>> In RDBMS it will be
>>>
>>> (Query1)
>>> where customer_id = '289'
>>> and event_time >= '2016-03-01 18:45:00+0000' and event_time <=
>>> '2016-03-12 19:05:00+0000'   ;
>>>
>>> But C* does not allow >= <= on PKY cols. It suggested token function.
>>>
>>> So I did this:
>>>
>>> (Query2)
>>> where token(customer_id,event_time) >= token('289','2016-03-01
>>> 18:45:00+0000')
>>> and token(customer_id,event_time) <= token('289','2016-03-12
>>> 19:05:00+0000')  ;
>>>
>>> I am seeing 75% more rows than what it should be. It should be 99K rows,
>>> it shows 163K.
>>>
>>> I checked the output with the csv file itself.  To double check I loaded
>>> the csv in another table
>>> with modified PKY so that the first query (Query1) can be executed. It
>>> also showed 99K rows.
>>>
>>> Am I using token function incorrectly ?
>>>
>>>
>>>
>>>
>>
>

Re: What is wrong in this token function

Reply via email to