Hi Duy, great questions. > 1) SASI was pretty inefficient indexing wide partitions because the index > structure only retains the partition token, not the clustering colums. As > per design doc SAI has row id mapping to partition offset, can we hope that > indexing wide partition will be more efficient with SAI ? One detail that > worries me is that in the beggining of the design doc, it is said that the > matching rows are post filtered while scanning the partition. Can you > confirm or infirm that SAI is efficient with wide partitions and provides > the partition offsets to the matching rows ?
As of now, SAI indexes partition offset, same as SASI. But during design, we have taken row-level-index into consideration and row-awareness is being prototyped. For the record, partition-level indexing works nicely when most rows in the wide partition match indexed value. After switching to row-level-index, when matching most rows in wide partition, the index engine needs to fall back to partition-level index behavior (scanning entire partition + post-filter) instead of fetching single rows many times. > 2) About space efficiency, one of the biggest drawback of SASI was the huge > space required for index structure when using CONTAINS logic because of the > decomposition of text columns into n-grams. Will SAI suffer from the same > issue in future iterations ? I'm anticipating a bit Tokenization wasn't part of the CEP scope. Off the top of my head, I think tokenization did require more space, as both SAI and SASI need to store matches for every decomposed value. But with frame-of-reference encoding on row ids, SAI should require less disk space than SASI. > 3) If I'm querying using SAI and providing complete partition key, will it > be more efficient than querying without partition key. In other words, does > SAI provide any optimisation when partition key is specified ? Yes. * On coordinator, it will find replicas with PK. * On replica side: - it will skip to given PK token - there is some pruning based on min/max key of index segments. > 4) Are collections, static columns, composite partition key composent and > UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe > that those features are the bare minimum to make SAI an interesting > replacement for the native 2nd index as well as SASI. SASI limited support > for those advanced data structures has hindered its wide adoption (among > other issues and bugs) Collections, static columns, composite partition key are supported. I think "UDT indexings (at any depth)" can be added because there is no architectural limitation on SAI or SASI. I have invited you to slack #cassandra-sai, really appreciate your participation. On Tue, 18 Aug 2020 at 19:33, DuyHai Doan <doanduy...@gmail.com> wrote: > Last but not least > > 4) Are collections, static columns, composite partition key composent and > UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe > that those features are the bare minimum to make SAI an interesting > replacement for the native 2nd index as well as SASI. SASI limited support > for those advanced data structures has hindered its wide adoption (among > other issues and bugs) > > Regards > > Duy Hai DOAN > > Le mar. 18 août 2020 à 13:02, Jasonstack Zhao Yang < > jasonstack.z...@gmail.com> a écrit : > > > Mick thanks for your questions. > > > > > During the 4.0 beta phase this was intended to be addressed, i.e.> > > defining more specific QA guidelines for 4.0-rc. This would be an > important > > > step towards QA guidelines for all changes and CEPs post-4.0. > > > > Agreed, I think CASSANDRA-15536 > > <https://issues.apache.org/jira/browse/CASSANDRA-15536> (4.0 Quality: > > Components and Test Plans) has set a good example of QA/Testing. > > > > > - How will this be tested, how will its QA status and lifecycle be> > > defined? (per above) > > > > SAI will follow the same QA/Testing guideline as in CASSANDRA-15536. > > > > > - With existing C* code needing to be changed, what is the proposed > > plan> for making those changes ensuring maintained QA, e.g. is there > > separate QA > > > cycles planned for altering the SPI before adding a new SPI > > implementation? > > > > The plan is to have interface changes and their new implementations to be > > reviewed/tested/merged at once to reduce overhead. > > > > But if having interface changes reviewed/tested/merged separately helps > > quality, I don't think anyone will object. > > > > > - Despite being out of scope, it would be nice to have some idea from > > the> CEP author of when users might still choose afresh 2i or SASI over > > SAI > > > > I'd like SAI to be the only index for users, but this is a decision to be > > made by the community. > > > > > - Who fills the roles involved? > > > > Contributors that are still active on C* or related projects: > > > > Andres de la Peña > > Caleb Rackliffe > > Dan LaRocque > > Jason Rutherglen > > Mike Adamson > > Rocco Varela > > Zhao Yang > > > > I will shepherd. > > > > Anyone that is interested in C* index, feel free to join us at slack > > #cassandra-sai. > > > > > - Is there a preference to use gdoc instead of the project's wiki, and> > > why? (the CEP process suggest a wiki page, and feedback on why another > > > approach is considered better helps evolve the CEP process itself) > > > > Didn't notice wiki is required. Will port CEP to wiki. > > > > > > On Tue, 18 Aug 2020 at 17:39, Mick Semb Wever <m...@apache.org> wrote: > > > > > > > > > > We are looking forward to the community's feedback and suggestions. > > > > > > > > > > > > > What comes immediately to mind is testing requirements. It has been > > > mentioned already that the project's testability and QA guidelines are > > > inadequate to successfully introduce new features and refactorings to > the > > > codebase. During the 4.0 beta phase this was intended to be addressed, > > i.e. > > > defining more specific QA guidelines for 4.0-rc. This would be an > > important > > > step towards QA guidelines for all changes and CEPs post-4.0. > > > > > > Questions from me > > > - How will this be tested, how will its QA status and lifecycle be > > > defined? (per above) > > > - With existing C* code needing to be changed, what is the proposed > plan > > > for making those changes ensuring maintained QA, e.g. is there separate > > QA > > > cycles planned for altering the SPI before adding a new SPI > > implementation? > > > - Despite being out of scope, it would be nice to have some idea from > > the > > > CEP author of when users might still choose afresh 2i or SASI over SAI, > > > - Who fills the roles involved? Who are the contributors in this > > DataStax > > > team? Who is the shepherd? Are there other stakeholders willing to be > > > involved? > > > - Is there a preference to use gdoc instead of the project's wiki, and > > > why? (the CEP process suggest a wiki page, and feedback on why another > > > approach is considered better helps evolve the CEP process itself) > > > > > > cheers, > > > Mick > > > > > >