MV build is also async. In the end it's MV maintenance cost vs Lucene index maintenance cost. I don't have clear figure to judge which one is better. Maybe you should benchmark yourself. Anyway I'll be interested by the results
On Tue, Oct 4, 2016 at 3:05 PM, Dorian Hoxha <dorian.ho...@gmail.com> wrote: > On lucene you can query+filter+sort on a single shard, so it should be > better than MV/sasi. The index building is a little async though. > > On Tue, Oct 4, 2016 at 2:29 PM, Benjamin Roth <benjamin.r...@jaumo.com> > wrote: > >> Thanks guys! >> >> Good to know, that my approach is basically right, but I will check that >> lucene indices by time. >> >> 2016-10-04 14:22 GMT+02:00 DuyHai Doan <doanduy...@gmail.com>: >> >>> "What scatter/gather? " >>> >>> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-f >>> ull-text-search-ride-voxxed-daybelgrade-2016/23 >>> >>> "If you partition your data by user_id then you query only 1 shard to >>> get sorted by time visitors for a user" >>> >>> Exact, but in this case, you're using a 2nd index only for sorting right >>> ? For SASI it's not even possible. Maybe it can work with Statrio Lucene >>> impl >>> >>> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha <dorian.ho...@gmail.com> >>> wrote: >>> >>>> @DuyHai >>>> >>>> What scatter/gather? If you partition your data by user_id then you >>>> query only 1 shard to get sorted by time visitors for a user. >>>> >>>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan <doanduy...@gmail.com> >>>> wrote: >>>> >>>>> MV is right now your best choice for this kind of sorting behavior. >>>>> >>>>> Secondary index (whatever the impl, SASI or Lucene) has a cost of >>>>> scatter-gather if your cluster scale out. With MV you're at least >>>>> guaranteed to hit a single node everytime >>>>> >>>>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha <dorian.ho...@gmail.com> >>>>> wrote: >>>>> >>>>>> Can you use the lucene index https://github.com/Stratio/cas >>>>>> sandra-lucene-index ? >>>>>> >>>>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth < >>>>>> benjamin.r...@jaumo.com> wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I have a frequently used pattern which seems to be quite costly in >>>>>>> CS. The pattern is always the same: I have a unique key and a sorting >>>>>>> by a >>>>>>> different field. >>>>>>> >>>>>>> To give an example, here a real life example from our model: >>>>>>> CREATE TABLE visits.visits_in ( >>>>>>> user_id int, >>>>>>> user_id_visitor int, >>>>>>> created timestamp, >>>>>>> PRIMARY KEY (user_id, user_id_visitor) >>>>>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC) >>>>>>> >>>>>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS >>>>>>> SELECT user_id, created, user_id_visitor >>>>>>> FROM visits.visits_in >>>>>>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND >>>>>>> user_id_visitor IS NOT NULL >>>>>>> PRIMARY KEY (user_id, created, user_id_visitor) >>>>>>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC) >>>>>>> >>>>>>> This simply represents people, that visited my profile sorted by >>>>>>> date desc but only one entry per visitor. >>>>>>> Other examples with the same pattern could be a whats-app-like inbox >>>>>>> where the last message of each sender is shown by date desc. There are >>>>>>> lots >>>>>>> of examples for that pattern. >>>>>>> >>>>>>> E.g. in redis I'd just use a sorted set, where the key could be like >>>>>>> "visits_${user_id}", set key would be user_id_visitor and score >>>>>>> the created timestamp. >>>>>>> In MySQL I'd create the table with PK on user_id + user_id_visitor >>>>>>> and create an index on user_id + created >>>>>>> In C* i use an MV. >>>>>>> >>>>>>> Is this the most efficient approach? >>>>>>> I also could have done this without an MV but then the situation in >>>>>>> our app would be far more complex. >>>>>>> I know that denormalization is a common pattern in C* and I don't >>>>>>> hesitate to use it but in this case, it is not as simple as it's not an >>>>>>> append-only case but updates have to be handled correctly. >>>>>>> If it is the first visit of a user, it's that simple, just 2 inserts >>>>>>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st >>>>>>> or >>>>>>> 2nd visit has to be deleted from the denormalized table before. >>>>>>> Otherwise >>>>>>> the visit would not be unique any more. >>>>>>> Handling this case without an MV requires a lot more effort, I guess >>>>>>> even more effort than just using an MV. >>>>>>> 1. You need kind of app-side locking to deal with race conditions >>>>>>> 2. Read before write is required to determine if an old record has >>>>>>> to be deleted >>>>>>> 3. At least CL_QUORUM is required to make sure that read before >>>>>>> write is always consistent >>>>>>> 4. Old record has to be deleted on update >>>>>>> >>>>>>> I guess, using an MV here is more efficient as there is less >>>>>>> roundtrip between C* and the app to do all that and the MV does not >>>>>>> require >>>>>>> strong consistency as MV updates are always local and are eventual >>>>>>> consistent when the base table is. So there is also no need for >>>>>>> distributed >>>>>>> locks. >>>>>>> >>>>>>> I ask all this as we now use CS 3.x and have been advised that 3.x >>>>>>> is still not considered really production ready. >>>>>>> >>>>>>> I guess in a perfect world, this wouldn't even require an MV if SASI >>>>>>> indexes could be created over more than 1 column. E.g. in MySQL this >>>>>>> case >>>>>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, >>>>>>> filtering >>>>>>> by Partition Key (which should to be done anyway) and sorting by a field >>>>>>> would perfectly do the trick. But from the docs, this is not possible >>>>>>> right >>>>>>> now. >>>>>>> >>>>>>> Does anyone see a better solution or are all my assumptions correct? >>>>>>> >>>>>>> -- >>>>>>> Benjamin Roth >>>>>>> Prokurist >>>>>>> >>>>>>> Jaumo GmbH · www.jaumo.com >>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany >>>>>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1 >>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> >> -- >> Benjamin Roth >> Prokurist >> >> Jaumo GmbH · www.jaumo.com >> Wehrstraße 46 · 73035 Göppingen · Germany >> Phone +49 7161 304880-6 · Fax +49 7161 304880-1 >> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >> > >