Hey Martin, Thanks for giving this so much thought!
We were also talking about a multi stage solution similar to what you suggested. While I think it would work, it would be a lot of overhead given that we have a few hundred tables in our database (we don't care about them all for this application, but a generic solution is appealing). I'm also curious what the average Kafka topics look like for other people pushing database changes through it. Is every table it's own topic? Is something like user_id denmoralized onto every table so that they can all be partitioned evenly? How many topics can the average cluster handle? (If we end up with a lot of Samza jobs that are all producing Kafka topics does it fall over at some point?) Are there any good resources for that kind of information? David On Thu, Nov 20, 2014 at 6:48 PM, Martin Kleppmann <[email protected]> wrote: > On 19 Nov 2014, at 20:21, David Pick <[email protected]> wrote: > > Right now it's hundreds of gig. Like I mentioned before we'd love to > > partition our data better, but because we don't have a common join key on > > every table we haven't really come up with a good scheme for doing this. > > Your many-to-many join example is interesting. I think you could probably > do it as a multi-stage pipeline. Say you want to generate a list of all the > transactions for a particular customer. Your input streams are: > > * Customers, partitioned by customer_id > * Customer_Transactions, partitioned by customer_id > * Transactions, partitioned by transaction_id > > The data flow is: > > 1. Join Customers and Customer_Transactions on customer_id, and emit > messages of the form (Customer, transaction_id), partitioned by > transaction_id. > 2. Join output of step 1 with Transactions on transaction_id, and emit > messages of the form (Customer, Transaction), partitioned by customer_id. > 3. Consume output of step 2, and group them by customer_id to get all the > transactions for one customer in one place. > > That may seem a bit convoluted, but it might actually perform quite well. > At least it will allow you to break the data into many partitions. (I'm > hoping we can build higher-level tools in future which will abstract away > this complexity.) > > > This is great, though I think ideally what we would want is to have a > kafka > > topic per database table, and a Samza task that could consume all of one > > topic for a table that's fairly small (e.g. customers) and a single > > partition of a big table (e.g. posts). > > This may be even closer to what you need: > https://issues.apache.org/jira/browse/SAMZA-402 > > Best, > Martin > >
