Hi Vincent, I think it makes sense to have some sort of `readAll` for CassandraIO that can receive multiple queries, and execute each one of them. This would also be consistent with other IOs that we have such as FileIOs. I suspect that doing this may require rearchitecting the whole IO from a BoundedSource-based one to a ParDo-based one - so a large change; and we'd need to make sure that we don't lose scalability due to that change.
Adding Ismael/JB/Etienne who've done a lot of the work on CassandraIO. Thoughts? -P. On Mon, Oct 14, 2019 at 3:32 PM Vincent Marquez <[email protected]> wrote: > Hello Pablo, thank you for the response, and apologies for the delay. I > had some work and also wanted to prove out what I was proposing with our > own code at my workplace. > > Here is a small gist of what I'm proposing. > > https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25 > > I'm happy to explain more or even write up an official design doc if you > think that would be helpful explaining things. > > --Vincent > > On 2019/10/04 18:03:23, Pablo Estrada <[email protected]> wrote: > > Hi Vincent!> > > Do you think you could add some code snippets / pseudocode as to what > this> > > looks like? Feel free to do it on email, gist, google doc, etc?> > > Best> > > -P.> > > > > On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <[email protected]>> > > wrote:> > > > > > Currently the CassandraIO connector allows a user to specify a table, > and> > > > the CassandraSource object generates a list of queries based on token> > > > ranges of the table, along with grouping them by the token ranges.> > > >> > > > I often need to run (generated, sometimes a million+) queries against > a> > > > subset of a table. Instead of providing a filter, it is easier and > much> > > > more performant to supply a collection of queries along with their > tokens> > > > to both partition and group by, instead of letting CassandraIO naively > run> > > > over the entire table or with a simple filter.> > > >> > > > I propose in addition to the current method of supplying a table and> > > > filter, also allowing the user to pass in a collection of queries and> > > > tokens. The current way CassandraSource breaks up the table could > be> > > > modified to build on top of the proposed implementation to reduce > code> > > > duplication as well. If this sounds like an acceptable alternative > way of> > > > using the CassandraIO connector, I don't mind giving it a shot with a > pull> > > > request.> > > >> > > > If there is a better way of doing this, I'm eager to hear and learn.> > > > Thanks for reading!> > > >> > >
