[ https://issues.apache.org/jira/browse/BEAM-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated BEAM-9008: ------------------------------- Summary: Add readAll() method to CassandraIO (was: add readAll() method to CassandraIO) > Add readAll() method to CassandraIO > ----------------------------------- > > Key: BEAM-9008 > URL: https://issues.apache.org/jira/browse/BEAM-9008 > Project: Beam > Issue Type: New Feature > Components: io-java-cassandra > Affects Versions: 2.16.0 > Reporter: vincent marquez > Assignee: vincent marquez > Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > When querying a large cassandra database, it's often *much* more useful to > programatically generate the queries needed to to be run rather than reading > all partitions and attempting some filtering. > As an example: > {code:java} > public class Event { > @PartitionKey(0) public UUID accountId; > @PartitionKey(1)public String yearMonthDay; > @ClusteringKey public UUID eventId; > //other data... > }{code} > If there is ten years worth of data, you may want to only query one year's > worth. Here each token range would represent one 'token' but all events for > the day. > {code:java} > Set<UUID> accounts = getRelevantAccounts(); > Set<String> dateRange = generateDateRange("2018-01-01", "2019-01-01"); > PCollection<TokenRange> tokens = generateTokens(accounts, dateRange); > {code} > > I propose an additional _readAll()_ PTransform that can take a PCollection > of token ranges and can return a PCollection<T> of what the query would > return. > *Question: How much code should be in common between both methods?* > Currently the read connector already groups all partitions into a List of > Token Ranges, so it would be simple to refactor the current read() based > method to a 'ParDo' based one and have them both share the same function. > Reasons against sharing code between read and readAll > * Not having the read based method return a BoundedSource connector would > mean losing the ability to know the size of the data returned > * Currently the CassandraReader executes all the grouped TokenRange queries > *asynchronously* which is (maybe?) fine when all that's happening is > splitting up all the partition ranges but terrible for executing potentially > millions of queries. > Reasons _for_ sharing code would be simplified code base and that both of > the above issues would most likely have a negligable performance impact. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)