Question on table to table joins

David Pick Tue, 18 Nov 2014 19:56:05 -0800

Hey Everyone,

First off, if this is the wrong place to ask these kinds of questions
please let me know. I tried in IRC but didn't get an answer within a few
hours so I'm trying here.


I had a couple of questions around implementing a table to table join with
data coming from a database changelog through Kafka.

Let's say I've got two tables users and posts in my primary db where the
posts table has a user_id column. I've written a Samza job that joins those
two tables together by storing every user record and the merged document in
Leveldb and then outputing the resulting document to the changelog Kafka
topic.

Is this the right way to implement that kind of job? It seems that even
with a decent partitioning scheme the leveldb instance in each task will
get quite large, especially if we're joining several tables together that
have millions of rows (our real world use case would be 7 or 8 tables each
with many millions of records).

Also, given a task that's processing that much data where do you
recommended running Samza? Should we spin up another set of boxes or is it
ok to run it on the Kafka brokers (I heard it mentioned that this is how
LinkedIn is running Samza).

Thanks!
David

Question on table to table joins

Reply via email to