Hey Everyone, First off, if this is the wrong place to ask these kinds of questions please let me know. I tried in IRC but didn't get an answer within a few hours so I'm trying here.
I had a couple of questions around implementing a table to table join with data coming from a database changelog through Kafka. Let's say I've got two tables users and posts in my primary db where the posts table has a user_id column. I've written a Samza job that joins those two tables together by storing every user record and the merged document in Leveldb and then outputing the resulting document to the changelog Kafka topic. Is this the right way to implement that kind of job? It seems that even with a decent partitioning scheme the leveldb instance in each task will get quite large, especially if we're joining several tables together that have millions of rows (our real world use case would be 7 or 8 tables each with many millions of records). Also, given a task that's processing that much data where do you recommended running Samza? Should we spin up another set of boxes or is it ok to run it on the Kafka brokers (I heard it mentioned that this is how LinkedIn is running Samza). Thanks! David
