Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

Josh Elser Mon, 16 Apr 2018 12:22:37 -0700

That's a great suggestion too, Pedro!

Sounds like both are ultimately achieving the same thing. I just didn'tknow what all was possible inside of Kafka Streams ;). Thanks for sharing.


On 4/16/18 2:33 PM, Pedro Boado wrote:

I guess this thread is not about kafka streams but what Josh suggestedis basically my last resource plan for building kafka streams as you'llbe constrained by HBase/Phoenix upsert ratio -you'll be doing 5x thenumber of upserts-

In my experience Kafka Streams is not bad at all doing this kind ofjoins -either windowed or based on ktables-. As far as you're <100M rowsper stream and have a few GB of disk space per processing node availableit should be doable.

On Mon, 16 Apr 2018, 18:49 Rabin Banerjee, <dev.rabin.baner...@gmail.com<mailto:dev.rabin.baner...@gmail.com>> wrote:


    Thanks Josh !

    On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser <els...@apache.org
    <mailto:els...@apache.org>> wrote:

        Please keep communication on the mailing list.

        Remember that you can execute partial-row upserts with Phoenix.
        As long as you can generate the primary key from each stream,
        you don't need to do anything special in Kafka streams. You can
        just submit 5 UPSERTS (one for each stream), and the Phoenix
        table will eventually have the aggregated row when you are finished.

        On 4/16/18 1:30 PM, Rabin Banerjee wrote:

            Actually I haven't finalised anything just looking at
            different options.

            Basically if I want to join 5 streams and I want to create a
            denormalized stream. Now the problem is if Stream 1's output
            for current window is key 1,2,3,4,5. and might happen that
            all the other keys have already emitted that key before, I
            can not join them with Kafka streams.I need to maintain the
            whole state for all the streams. So I need to figure out the
            key 1,2,3,4,5 from all the stream and generate a combined
            one as realtime as possible.


            On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser
            <els...@apache.org <mailto:els...@apache.org>
            <mailto:els...@apache.org <mailto:els...@apache.org>>> wrote:

                 Short-answer: no.

                 You're going to be much better off de-normalizing your
            five tables
                 into one table and eliminate the need for this JOIN.

                 What made you decide to want to use Phoenix in the
            first place?


                 On 4/16/18 6:04 AM, Rabin Banerjee wrote:

                     HI all,

                     I am new to phoenix, I wanted to know if I have to
            join 5 huge
                     tables where all are keyed based on the same id
            (i.e. one id
                     columns is common between all of them), is there any
                     optimization to add to make this join faster , as
            all the data
                     for a particular key for all 5 tables will reside
            in the same
                     region server .

                     To explain it bit more, suppose we have 5 streams
            all having a
                     common id that we can join with are getting stored in 5
                     different hbase table. And we want to join them
            with Phoenix but
                     we dont want cross region shuffle as we already
            know that the
                     key is common in all 5 tables.


                     Thanks //

Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

Reply via email to