[ https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809735#comment-17809735 ]
Paulo Motta commented on CASSANDRASC-94: ---------------------------------------- Cool, thanks for clarifying! I can create a follow-up sidecar ticket if there's movement on CASSANDRA-18111. > Reduce filesystem calls while streaming SSTables > ------------------------------------------------ > > Key: CASSANDRASC-94 > URL: https://issues.apache.org/jira/browse/CASSANDRASC-94 > Project: Sidecar for Apache Cassandra > Issue Type: Improvement > Components: Configuration > Reporter: Francisco Guerrero > Assignee: Francisco Guerrero > Priority: Normal > Labels: pull-request-available > > When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will > perform multiple filesystem calls: > - Traverse the data directories to determine the keyspace / table path > - Once found determine if the SSTable file exists under the snapshots > directory > - Read the filesystem to obtain the file type and file size > - Read the requested range of the file and stream it > The amount of filesystem calls is manageable for streaming a single SSTable, > but when a client(s) read multiple SSTables, for example in the case of > Cassandra Analytics bulk reads, hundred to thousand of requests are performed > requiring every request to perform the above system calls. > In this improvement, it is proposed introducing several caches to reduce the > amount of system calls while streaming SSTables. > - *snapshot list cache*: to maintain a cache of recently listed snapshot > files under a snapshot directory. This cache avoids having to access the > filesystem every time a bulk read client list the snapshot directory. > - *table dir cache*: to maintain a cache of recently streamed table directory > paths. This cache helps avoiding having to traverse the filesystem searching > for the table directory while running bulk reads for example. Since bulk > reads can stream tens to hundreds of SSTable components from a snapshot > directory, this cache helps avoid having to resolve the table directory each > time. > - *snapshot path cache*: to maintain a cache of recently streamed snapshot > SSTable components. This cache avoids having to resolve the snapshot SSTable > component path during bulk reads. Since bulk reads streams sub-ranges of an > SSTable component, the resolution can happen multiple times during bulk reads > for a single SSTable component. > - *file props cache*: to maintain a cache of FileProps of recently streamed > files. This cache avoids having to validate file properties during bulk reads > for example where sub-ranges of an SSTable component are streamed, therefore > reading the file properties can occur multiple times during bulk reads of the > same file. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org