Hi Andy,

One option would be to take the original transaction data and enrich it
with the customer ID and then send the result to a new topic, partitioned
appropriately. That would then allow you to do a join between that topic
and the ledger data (otherwise I don't see how you can join given that the
transactions do not have the customer ID). It's worth mentioning that in
Kafka trunk the repartitioning happens automatically (while in 0.10.0.0 the
user needs to manually repartition topics).

Eno

Begin forwarded message:

*From: *Andy Chambers <achambers.h...@gmail.com>
*Subject: **Re: Partitioning at the edges*
*Date: *3 September 2016 at 17:57:53 BST
*To: *users@kafka.apache.org
*Reply-To: *users@kafka.apache.org

Hi Eno,

I'll try. We have a feed of transaction data from the bank. Each of which
we must try to associate with a customer in our system. Unfortunately the
transaction data doesn't include the customer-id itself but rather a
variety of other identifiers that we can use to lookup the customer-id in a
mapping table (this is also in kafka)

Once we have the customer-id, we produce some records to a "ledger-request"
topic (which is partitioned by customer-id). The exact sequence of records
produced depends on the type of transaction, and whether we were able to
find the customer-id but if all goes well, the ledger will produce events
on a "ledger-result" topic for each request record.

This is where we have a problem. The ledger is the bit that has to be super
performant so these topics must be highly partitioned on customer-id. But
then we'd like to join the results with the original bank-transaction
(which remember doesn't even have a customer-id) so we can mark it as being
handled or not when the result comes back.

Current plan is to just use a single partition for most topics but then
"wrap" the ledger system with a process that re-partitions it's input as
necessary for scale.

Cheers,
Andy

On Sat, Sep 3, 2016 at 6:13 AM, Eno Thereska <eno.there...@gmail.com> wrote:

Hi Andy,

Could you share a bit more info or pseudocode so that we can understand
the scenario a bit better? Especially around the streams at the edges. How
are they created and what is the join meant to do?

Thanks
Eno

On 3 Sep 2016, at 02:43, Andy Chambers <achambers.h...@gmail.com> wrote:

Hey Folks,

We are having quite a bit trouble modelling the flow of data through a

very

kafka centric system

As I understand it, every stream you might want to join with another must
be partitioned the same way. But often streams at the edges of a system
*cannot* be partitioned the same way because they don't have the

partition

key yet (often the work for this process is to find the key in some

lookup

table based on some other key we don't control).

We have come up with a few solutions but everything seems to add

complexity

and backs our designs into a corner.

What is frustrating is that most of the data is not really that big but

we

have a handful of topics we expect to require a lot of throughput.

Is this just unavoidable complexity asociated with scale or am I thinking
about this in the wrong way. We're going all in on the "turning the
database inside out" architecture but we end up spending more time

thinking

about how stuff gets broken up into tasks and distributed than we are

about

our business.

Do these problems seem familiar to anyone else?  Did you find any

patterns

that helped keep the complexity down.

Cheers

Reply via email to