Re: [DISCUSS] CEP-7 Storage Attached Index

Joshua McKenzie Tue, 25 Aug 2020 07:04:06 -0700

>
> Does community plan to open another discussion or CEP on modularization?


We probably should have a discussion on the ML or monthly contrib call
about it first to see how aligned the interested contributors are. Could do
that through CEP as well but CEP's (at least thus far sans k8s operator)
tend to start with a strong, deeply thought out point of view being
expressed.

On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
[email protected]> wrote:

> >>> SASI's performance, specifically the search in the B+ tree component,
> >>> depends a lot on the component file's header being available in the
> >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> bound
> >>> to this same or similar limitation?
>
> SAI also benefits from larger memory because SAI puts block info on heap
> for searching on-disk components and having cross-index files on page cache
> improves read performance of different indexes on the same table.
>
>
> >>> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> >>> pauses, and crashes on the node. SSDs are a must, along with a bit of
> >>> tuning, just to avoid bringing down your cluster. Beyond reducing space
> >>> requirements, does SAI improve on these things? Like SASI how does SAI,
> in
> >>> its own way, change/narrow the recommendations on node hardware specs?
>
> SAI won't crash the node during compaction and requires less CPU/IO.
>
> * SAI defines global memory limit for compaction instead of per-index
> memory limit used by SASI.
>   For example, compactions are running on 10 tables and each has 10
> indexes. SAI will cap the
>   memory usage with global limit while SASI may use up to 100 * per-index
> limit.
>
> * After flushing in-memory segments to disk, SAI won't merge on-disk
> segments while SASI
>   attempts to merge them at the end.
>
>   There are pros and cons of not merging segments:
>     ** Pros: compaction runs faster and requires fewer resources.
>     ** Cons: small segments reduce compression ratio.
>
> * SAI on-disk format with row ids compresses better.
>
>
> >>> I understand the desire in keeping out of scope the longer term
> deprecation
> >>> and migration plan, but… if SASI provides functionality that SAI
> doesn't,
> >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> >>> ~somewhat similar, shouldn't we be roughly sketching out how to reduce
> the
> >>> maintenance surface area?
>
> Agreed that we should reduce maintenance area if possible, but only very
> limited
> code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
> code base
> is quite different because of on-disk format and cross-index files.
>
> The goal of this CEP is to get community buy-in on SAI's design.
> Tokenization,
> DelimiterAnalyzer should be straightforward to implement on top of SAI.
>
> >>> Can we list what configurations of SASI will become deprecated once SAI
> >>> becomes non-experimental?
>
> Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of SASI
> can
> be replaced by SAI.
>
> >>> Given a few bugs are open against 2i and SASI, can we provide some
> >>> overview, or rough indication, of how many of them we could "triage
> away"?
>
> I believe most of the known bugs in 2i/SASI either have been addressed in
> SAI or
> don't apply to SAI.
>
> >>> And, is it time for the project to start introducing new SPI
> >>> implementations as separate sub-modules and jar files that are only
> loaded
> >>> at runtime based on configuration settings? (sorry for the conflation
> on
> >>> this one, but maybe it's the right time to raise it :shrug:)
>
> Agreed that modularization is the way to go and will speed up module
> development speed.
>
> Does community plan to open another discussion or CEP on modularization?
>
>
> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <[email protected]> wrote:
>
> > Adding to Duy's questions…
> >
> >
> > * Hardware specs
> >
> > SASI's performance, specifically the search in the B+ tree component,
> > depends a lot on the component file's header being available in the
> > pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> bound
> > to this same or similar limitation?
> >
> > Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > pauses, and crashes on the node. SSDs are a must, along with a bit of
> > tuning, just to avoid bringing down your cluster. Beyond reducing space
> > requirements, does SAI improve on these things? Like SASI how does SAI,
> in
> > its own way, change/narrow the recommendations on node hardware specs?
> >
> >
> > * Code Maintenance
> >
> > I understand the desire in keeping out of scope the longer term
> deprecation
> > and migration plan, but… if SASI provides functionality that SAI doesn't,
> > like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> > ~somewhat similar, shouldn't we be roughly sketching out how to reduce
> the
> > maintenance surface area?
> >
> > Can we list what configurations of SASI will become deprecated once SAI
> > becomes non-experimental?
> >
> > Given a few bugs are open against 2i and SASI, can we provide some
> > overview, or rough indication, of how many of them we could "triage
> away"?
> >
> > And, is it time for the project to start introducing new SPI
> > implementations as separate sub-modules and jar files that are only
> loaded
> > at runtime based on configuration settings? (sorry for the conflation on
> > this one, but maybe it's the right time to raise it :shrug:)
> >
> > regards,
> > Mick
> >
> >
> > On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <[email protected]> wrote:
> >
> > > Thank you Zhao Yang for starting this topic
> > >
> > > After reading the short design doc, I have a few questions
> > >
> > > 1) SASI was pretty inefficient indexing wide partitions because the
> index
> > > structure only retains the partition token, not the clustering colums.
> As
> > > per design doc SAI has row id mapping to partition offset, can we hope
> > that
> > > indexing wide partition will be more efficient with SAI ? One detail
> that
> > > worries me is that in the beggining of the design doc, it is said that
> > the
> > > matching rows are post filtered while scanning the partition. Can you
> > > confirm or infirm that SAI is efficient with wide partitions and
> provides
> > > the partition offsets to the matching rows ?
> > >
> > > 2) About space efficiency, one of the biggest drawback of SASI was the
> > huge
> > > space required for index structure when using CONTAINS logic because of
> > the
> > > decomposition of text columns into n-grams. Will SAI suffer from the
> same
> > > issue in future iterations ? I'm anticipating a bit
> > >
> > > 3) If I'm querying using SAI and providing complete partition key, will
> > it
> > > be more efficient than querying without partition key. In other words,
> > does
> > > SAI provide any optimisation when partition key is specified ?
> > >
> > > Regards
> > >
> > > Duy Hai DOAN
> > >
> > > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <[email protected]> a
> écrit :
> > >
> > > > >
> > > > > We are looking forward to the community's feedback and suggestions.
> > > > >
> > > >
> > > >
> > > > What comes immediately to mind is testing requirements. It has been
> > > > mentioned already that the project's testability and QA guidelines
> are
> > > > inadequate to successfully introduce new features and refactorings to
> > the
> > > > codebase. During the 4.0 beta phase this was intended to be
> addressed,
> > > i.e.
> > > > defining more specific QA guidelines for 4.0-rc. This would be an
> > > important
> > > > step towards QA guidelines for all changes and CEPs post-4.0.
> > > >
> > > > Questions from me
> > > >  - How will this be tested, how will its QA status and lifecycle be
> > > > defined? (per above)
> > > >  - With existing C* code needing to be changed, what is the proposed
> > plan
> > > > for making those changes ensuring maintained QA, e.g. is there
> separate
> > > QA
> > > > cycles planned for altering the SPI before adding a new SPI
> > > implementation?
> > > >  - Despite being out of scope, it would be nice to have some idea
> from
> > > the
> > > > CEP author of when users might still choose afresh 2i or SASI over
> SAI,
> > > >  - Who fills the roles involved? Who are the contributors in this
> > > DataStax
> > > > team? Who is the shepherd? Are there other stakeholders willing to be
> > > > involved?
> > > >  - Is there a preference to use gdoc instead of the project's wiki,
> and
> > > > why? (the CEP process suggest a wiki page, and feedback on why
> another
> > > > approach is considered better helps evolve the CEP process itself)
> > > >
> > > > cheers,
> > > > Mick
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to