Thank you Crisan. Using SASI does seems better solution. Although it is officially <https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html> not supported in production, I think that this is the optimal solution in this case
On Mon, Oct 9, 2017 at 11:01 PM, Valentina Crisan < valentina.cri...@gmail.com> wrote: > Not really, my suggested primary key is similar to the one you have in > your proposed MV. The only difference is that in MV it is Cassandra that > takes care of data synchronization, with manual denormalization you would > need to do it yourself. Example with MV: If you had username 'andreas1988' > and last_seen "2017-09-11 23:58:23' in your base table and then this user > access the service and last_seen is updated to "2017-10-09 23:58:23" in > your base table - what will happen in the background is that MV Cassandra > will delete in a batch operation from the partition "2017-09-11 23:58:23" > the username "andreas1988" and add it to partition "2017-10-09 23:58:23". > Only when this batch will finish Cassandra will update the base table. > If you denormalize manually it will be you that will need to create > batches operations and do this changes manually, making sure that you only > save last value for last_seen in your table. You will obtain the same in > the end only the operation effort will be bigger. > I understand why MV would be good for your requirements, but I have seen > from the discussions that MV is not recommended for production mainly due > to the fact that is not possible to check if a view is out of sync with the > base table. Check older discussions (one or two weeks ago) and see details > there re MV usage in production. > > One other solution could be to work on your users table with a secondary > index on last_seen field ( Cassandra 3.4 onwards, SASI would allow > operators like <, > and multiple columns indexing) - clearly better than > allow filtering - but still the whole cluster would be contacted most of > the times for your queries. Maybe combining Cassandra SASI with Spark data > locality could solve this better. But first you could try with SASI and see > the query performance. > > Valentina > > > On Mon, Oct 9, 2017 at 7:56 PM, Avi Levi <a...@indeni.com> wrote: > >> Thanks Crisan . >> I understand what you're saying. But according to your suggestion I will >> have a record for every entry while I am interested only on the last entry >> . So the proposed solution is actually keeping much more data then needed . >> >> On Oct 9, 2017 8:40 PM, "Valentina Crisan" <valentina.cri...@gmail.com> >> wrote: >> >> Allow filtering is almost never the answer, especially when you want to >> do a full table scan ( there might be some cases where the query is limited >> to a partition and allow filtering could be used). And you would like to >> run this query every minute - thus extremely good performance is required. >> Allow filtering basically brings locally in your coordinator the whole >> table content and performs local filtering of the data before answering >> your query. Performance wise is not recommended to use such an >> implementation. >> >> For a query running every minute you need to address it in one partition >> read (according to Cassandra data modeling rules) and that can be done with >> denormalization ( manually or materialized views). As far as I know and >> also from the discussions in this list MV should be used still with caution >> in production environments. Thus, the best option in my opinion is manual >> denormalization of data, building a table with partition key last_seen and >> clustering key username and adding/updating data accordingly. Furthermore >> last_seen I understand it's a value of any time/hour of day - you could >> consider building partitions per day: partition key = (last_seen, day), >> primary key = ((last_seen,day),username)). >> >> Valentina >> >> On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <a...@indeni.com> wrote: >> >>> Hi >>> >>> I have the following table: >>> >>> CREATE TABLE users ( >>> username text, >>> last_seen bigint, >>> PRIMARY KEY (username) >>> ); >>> >>> where* last_seen* is basically the writetime . Number of records in the >>> table is aprox 10 million. Insert is pretty much straightforward insert >>> into users (username, last_seen) VALUES ([username], now) >>> >>> I want to make some processing on users that were not seen for the past >>> XXX (where xxx can be hours/days ... ) by query the last_seen column >>> (this query runs every minute) e.g : >>> >>> select username from users where last_seen < (now - 1 day). >>> >>> I have two options as I see it: >>> >>> 1. use materialized view : >>> >>> CREATE MATERIALIZED VIEW users_last_seen AS >>> SELECT last_seen, username >>> FROM users >>> WHERE last_seen IS NOT NULL >>> PRIMARY KEY (last_seen, username); >>> >>> >>> and simply query: >>> >>> select username from users_last_seen where last_seen < (now - 1 day) >>> >>> 1. >>> >>> query the users table >>> >>> select username from users where last_seen < (now - 1 day) ALLOW >>> FILTERING >>> >>> which one is more efficient? any other options ? >>> >>> Any help will be greatly appreciated >>> >>> Best >>> >>> Avi >>> >> >> >> >