Re: [DISCUSS] CEP-7 Storage Attached Index

Jasonstack Zhao Yang Wed, 19 Aug 2020 00:44:01 -0700

Hi Duy, great questions.

> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope
that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?


As of now, SAI indexes partition offset, same as SASI. But during design, we
have taken row-level-index into consideration and row-awareness is being
prototyped.

For the record, partition-level indexing works nicely when most rows in the
wide
partition match indexed value. After switching to row-level-index, when
matching
most rows in wide partition, the index engine needs to fall back to
partition-level
index behavior (scanning entire partition + post-filter) instead of
fetching single
rows many times.

> 2) About space efficiency, one of the biggest drawback of SASI was the
huge
> space required for index structure when using CONTAINS logic because of
the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit

Tokenization wasn't part of the CEP scope.

Off the top of my head, I think tokenization did require more space, as
both SAI and SASI
need to store matches for every decomposed value. But with
frame-of-reference encoding
on row ids, SAI should require less disk space than SASI.

> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words,
does
> SAI provide any optimisation when partition key is specified ?

Yes.

* On coordinator, it will find replicas with PK.
* On replica side:
 - it will skip to given PK token
 - there is some pruning based on min/max key of index segments.

> 4) Are collections, static columns, composite partition key composent and
> UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe
> that those features are the bare minimum to make SAI an interesting
> replacement for the native 2nd index as well as SASI. SASI limited support
> for those advanced data structures has hindered its wide adoption (among
> other issues and bugs)

Collections, static columns, composite partition key are supported.

I think "UDT indexings (at any depth)" can be added because there is no
architectural limitation on SAI or SASI.

I have invited you to slack #cassandra-sai, really appreciate your
participation.


On Tue, 18 Aug 2020 at 19:33, DuyHai Doan <doanduy...@gmail.com> wrote:

> Last but not least
>
> 4) Are collections, static columns, composite partition key composent and
> UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe
> that those features are the bare minimum to make SAI an interesting
> replacement for the native 2nd index as well as SASI. SASI limited support
> for those advanced data structures has hindered its wide adoption (among
> other issues and bugs)
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 13:02, Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> a écrit :
>
> > Mick thanks for your questions.
> >
> > > During the 4.0 beta phase this was intended to be addressed, i.e.>
> > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Agreed, I think CASSANDRA-15536
> > <https://issues.apache.org/jira/browse/CASSANDRA-15536> (4.0 Quality:
> > Components and Test Plans) has set a good example of QA/Testing.
> >
> > >  - How will this be tested, how will its QA status and lifecycle be>
> > defined? (per above)
> >
> > SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.
> >
> > >  - With existing C* code needing to be changed, what is the proposed
> > plan> for making those changes ensuring maintained QA, e.g. is there
> > separate QA
> > > cycles planned for altering the SPI before adding a new SPI
> > implementation?
> >
> > The plan is to have interface changes and their new implementations to be
> > reviewed/tested/merged at once to reduce overhead.
> >
> > But if having interface changes reviewed/tested/merged separately helps
> > quality, I don't think anyone will object.
> >
> > > - Despite being out of scope, it would be nice to have some idea from
> > the>  CEP author of when users might still choose afresh 2i or SASI over
> > SAI
> >
> > I'd like SAI to be the only index for users, but this is a decision to be
> > made by the community.
> >
> > > - Who fills the roles involved?
> >
> > Contributors that are still active on C* or related projects:
> >
> > Andres de la Peña
> > Caleb Rackliffe
> > Dan LaRocque
> > Jason Rutherglen
> > Mike Adamson
> > Rocco Varela
> > Zhao Yang
> >
> > I will shepherd.
> >
> > Anyone that is interested in C* index, feel free to join us at slack
> > #cassandra-sai.
> >
> > > - Is there a preference to use gdoc instead of the project's wiki, and>
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > > approach is considered better helps evolve the CEP process itself)
> >
> > Didn't notice wiki is required. Will port CEP to wiki.
> >
> >
> > On Tue, 18 Aug 2020 at 17:39, Mick Semb Wever <m...@apache.org> wrote:
> >
> > > >
> > > > We are looking forward to the community's feedback and suggestions.
> > > >
> > >
> > >
> > > What comes immediately to mind is testing requirements. It has been
> > > mentioned already that the project's testability and QA guidelines are
> > > inadequate to successfully introduce new features and refactorings to
> the
> > > codebase. During the 4.0 beta phase this was intended to be addressed,
> > i.e.
> > > defining more specific QA guidelines for 4.0-rc. This would be an
> > important
> > > step towards QA guidelines for all changes and CEPs post-4.0.
> > >
> > > Questions from me
> > >  - How will this be tested, how will its QA status and lifecycle be
> > > defined? (per above)
> > >  - With existing C* code needing to be changed, what is the proposed
> plan
> > > for making those changes ensuring maintained QA, e.g. is there separate
> > QA
> > > cycles planned for altering the SPI before adding a new SPI
> > implementation?
> > >  - Despite being out of scope, it would be nice to have some idea from
> > the
> > > CEP author of when users might still choose afresh 2i or SASI over SAI,
> > >  - Who fills the roles involved? Who are the contributors in this
> > DataStax
> > > team? Who is the shepherd? Are there other stakeholders willing to be
> > > involved?
> > >  - Is there a preference to use gdoc instead of the project's wiki, and
> > > why? (the CEP process suggest a wiki page, and feedback on why another
> > > approach is considered better helps evolve the CEP process itself)
> > >
> > > cheers,
> > > Mick
> > >
> >
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to