Hello Pablo, thank you for the response, and apologies for the delay. I had some work and also wanted to prove out what I was proposing with our own code at my workplace.
Here is a small gist of what I'm proposing. https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25 I'm happy to explain more or even write up an official design doc if you think that would be helpful explaining things. --Vincent On 2019/10/04 18:03:23, Pablo Estrada <[email protected]> wrote: > Hi Vincent!> > Do you think you could add some code snippets / pseudocode as to what this> > looks like? Feel free to do it on email, gist, google doc, etc?> > Best> > -P.> > > On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <[email protected]>> > wrote:> > > > Currently the CassandraIO connector allows a user to specify a table, and> > > the CassandraSource object generates a list of queries based on token> > > ranges of the table, along with grouping them by the token ranges.> > >> > > I often need to run (generated, sometimes a million+) queries against a> > > subset of a table. Instead of providing a filter, it is easier and much> > > more performant to supply a collection of queries along with their tokens> > > to both partition and group by, instead of letting CassandraIO naively run> > > over the entire table or with a simple filter.> > >> > > I propose in addition to the current method of supplying a table and> > > filter, also allowing the user to pass in a collection of queries and> > > tokens. The current way CassandraSource breaks up the table could be> > > modified to build on top of the proposed implementation to reduce code> > > duplication as well. If this sounds like an acceptable alternative way of> > > using the CassandraIO connector, I don't mind giving it a shot with a pull> > > request.> > >> > > If there is a better way of doing this, I'm eager to hear and learn.> > > Thanks for reading!> > >> >
