Thank you Crisan.
Using SASI does seems better solution. Although it is officially
<https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html>
not supported in production, I think that this is the optimal solution in
this case

On Mon, Oct 9, 2017 at 11:01 PM, Valentina Crisan <
valentina.cri...@gmail.com> wrote:

> Not really, my suggested primary key is similar to the one you have in
> your proposed MV. The only difference is that in MV it is Cassandra that
> takes care of data synchronization,  with manual denormalization you would
> need to do it yourself. Example with MV: If you had username 'andreas1988'
> and last_seen "2017-09-11 23:58:23' in your base table and then this user
> access the service and last_seen is updated to "2017-10-09 23:58:23" in
> your base table - what will happen in the background is that MV Cassandra
> will delete in a batch operation from the partition "2017-09-11 23:58:23"
> the username "andreas1988" and add it to partition "2017-10-09 23:58:23".
> Only when this batch will finish Cassandra will update the base table.
> If you denormalize manually it will be you that will need to create
> batches operations and do this changes manually, making sure that you only
> save last value for last_seen in your table. You will obtain the same in
> the end only the operation effort will be bigger.
> I understand why MV would be good for your requirements, but I have seen
> from the discussions that MV is not recommended for production mainly due
> to the fact that is not possible to check if a view is out of sync with the
> base table. Check older discussions (one or two weeks ago) and see details
> there re MV usage in production.
>
> One other solution could be to work on your users table with a secondary
> index on last_seen field ( Cassandra 3.4 onwards,  SASI would allow
> operators like <, > and multiple columns indexing) - clearly better than
> allow filtering - but still the whole cluster would be contacted most of
> the times for your queries. Maybe combining Cassandra SASI with Spark data
> locality could solve this better. But first you could try with SASI and see
> the query performance.
>
> Valentina
>
>
> On Mon, Oct 9, 2017 at 7:56 PM, Avi Levi <a...@indeni.com> wrote:
>
>> Thanks Crisan .
>> I understand what you're saying. But according to your suggestion I will
>> have a record for every entry while I am interested only on the last entry
>> . So the proposed solution is actually keeping much more data then needed .
>>
>> On Oct 9, 2017 8:40 PM, "Valentina Crisan" <valentina.cri...@gmail.com>
>> wrote:
>>
>> Allow filtering is almost never the answer, especially when you want to
>> do a full table scan ( there might be some cases where the query is limited
>> to a partition and allow filtering could be used). And you would like to
>> run this query every minute - thus extremely good performance is required.
>> Allow filtering basically brings locally in your coordinator the whole
>> table content and performs local filtering of the data before answering
>> your query. Performance wise is not recommended to use such an
>> implementation.
>>
>> For a query running every minute you need to address it in one partition
>> read (according to Cassandra data modeling rules) and that can be done with
>> denormalization ( manually or materialized views). As far as I know and
>> also from the discussions in this list MV should be used still with caution
>> in production environments. Thus, the best option in my opinion is manual
>> denormalization of data, building a table with partition key last_seen and
>> clustering key username and adding/updating data accordingly. Furthermore
>> last_seen I understand it's a value of any time/hour of day - you could
>> consider building partitions per day: partition key  = (last_seen, day),
>> primary key = ((last_seen,day),username)).
>>
>> Valentina
>>
>> On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <a...@indeni.com> wrote:
>>
>>> Hi
>>>
>>> I have the following table:
>>>
>>> CREATE TABLE users (
>>>     username text,
>>>     last_seen bigint,
>>>     PRIMARY KEY (username)
>>> );
>>>
>>> where* last_seen* is basically the writetime . Number of records in the
>>> table is aprox 10 million. Insert is pretty much straightforward insert
>>> into users (username, last_seen) VALUES ([username], now)
>>>
>>> I want to make some processing on users that were not seen for the past
>>> XXX (where xxx can be hours/days ... ) by query the last_seen column
>>> (this query runs every minute) e.g :
>>>
>>> select username from users where last_seen < (now - 1 day).
>>>
>>> I have two options as I see it:
>>>
>>>    1. use materialized view :
>>>
>>> CREATE MATERIALIZED VIEW users_last_seen AS
>>> SELECT last_seen, username
>>> FROM users
>>> WHERE last_seen IS NOT NULL
>>> PRIMARY KEY (last_seen, username);
>>>
>>>
>>> and simply query:
>>>
>>> select username from users_last_seen where last_seen < (now - 1 day)
>>>
>>>    1.
>>>
>>>    query the users table
>>>
>>>    select username from users where last_seen < (now - 1 day) ALLOW
>>>    FILTERING
>>>
>>> which one is more efficient? any other options ?
>>>
>>> Any help will be greatly appreciated
>>>
>>> Best
>>>
>>> Avi
>>>
>>
>>
>>
>

Reply via email to