Currently the CassandraIO connector allows a user to specify a table, and
the CassandraSource object generates a list of queries based on token
ranges of the table, along with grouping them by the token ranges.

I often need to run (generated, sometimes a million+) queries against a
subset of a table.  Instead of providing a filter, it is easier and much
more performant to supply a collection of queries along with their tokens
to both partition and group by, instead of letting CassandraIO naively run
over the entire table or with a simple filter.

I propose in addition to the current method of supplying a table and
filter, also allowing the user to pass in a collection of queries and
tokens.   The current way CassandraSource breaks up the table could be
modified to build on top of the proposed implementation to reduce code
duplication as well.  If this sounds like an acceptable alternative way of
using the CassandraIO connector, I don't mind giving it a shot with a pull
request.

If there is a better way of doing this, I'm eager to hear and learn.
Thanks for reading!

Reply via email to