Hi David, I am also wondering why you are storing merged document in local LevelDB. If you need to check for duplicates, how about using a bloom filter to handle duplicates.
Thanks Milinda On Wed, Nov 19, 2014 at 8:50 AM, Martin Kleppmann <[email protected]> wrote: > Hi David, > > On 19 Nov 2014, at 04:52, David Pick <[email protected]> wrote: > > First off, if this is the wrong place to ask these kinds of questions > > please let me know. I tried in IRC but didn't get an answer within a few > > hours so I'm trying here. > > It's the right place, and they're good questions :) > > > I had a couple of questions around implementing a table to table join > with > > data coming from a database changelog through Kafka. > > Cool! Out of curiosity, can I ask what database you're using and how > you're getting the changelog? > > > Let's say I've got two tables users and posts in my primary db where the > > posts table has a user_id column. I've written a Samza job that joins > those > > two tables together by storing every user record and the merged document > in > > Leveldb and then outputing the resulting document to the changelog Kafka > > topic. > > Not sure exactly what you mean with a merged document here. Is it a list > of all the posts by a particular user? Or post records augmented with user > information? Or something else? > > > Is this the right way to implement that kind of job? > > On a high level, this sounds reasonable. One detail question is whether > you're using the changelog feature for the LevelDB store. If the contents > of the store can be rebuilt by reprocessing the input stream, you could get > away with turning off the changelog on the store, and making the input > stream a bootstrap stream instead. That will save some overhead on writes, > but still give you fault tolerance. > > Another question is whether the values you're storing in LevelDB are > several merged records together, or whether you store each record > individually and use a range query to retrieve them if necessary. You could > benchmark which works best for you. > > > It seems that even > > with a decent partitioning scheme the leveldb instance in each task will > > get quite large, especially if we're joining several tables together that > > have millions of rows (our real world use case would be 7 or 8 tables > each > > with many millions of records). > > The goal of Samza's local key-value stores is to support a few gigabytes > per partition. So millions of rows distributed across (say) tens of > partitions should be fine, assuming the rows aren't huge. You might want > to try the new RocksDB support, which may perform better than LevelDB. > > > Also, given a task that's processing that much data where do you > > recommended running Samza? Should we spin up another set of boxes or is > it > > ok to run it on the Kafka brokers (I heard it mentioned that this is how > > LinkedIn is running Samza). > > IIRC LinkedIn saw some issues with the page cache when Kafka and LevelDB > were on the same machine, which was resolved by putting them on different > machines. But I'll let one of the LinkedIn folks comment on that. > > Best, > Martin > > -- Milinda Pathirage PhD Student | Research Assistant School of Informatics and Computing | Data to Insight Center Indiana University twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org
