Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

2018-04-16 Thread Josh Elser

Short-answer: no.

You're going to be much better off de-normalizing your five tables into 
one table and eliminate the need for this JOIN.


What made you decide to want to use Phoenix in the first place?

On 4/16/18 6:04 AM, Rabin Banerjee wrote:

HI all,

I am new to phoenix, I wanted to know if I have to join 5 huge tables 
where all are keyed based on the same id (i.e. one id columns is common 
between all of them), is there any optimization to add to make this join 
faster , as all the data for a particular key for all 5 tables will 
reside in the same region server .


To explain it bit more, suppose we have 5 streams all having a common id 
that we can join with are getting stored in 5 different hbase table. And 
we want to join them with Phoenix but we dont want cross region shuffle 
as we already know that the key is common in all 5 tables.



Thanks //


Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

2018-04-16 Thread Josh Elser

Please keep communication on the mailing list.

Remember that you can execute partial-row upserts with Phoenix. As long 
as you can generate the primary key from each stream, you don't need to 
do anything special in Kafka streams. You can just submit 5 UPSERTS (one 
for each stream), and the Phoenix table will eventually have the 
aggregated row when you are finished.


On 4/16/18 1:30 PM, Rabin Banerjee wrote:

Actually I haven't finalised anything just looking at different options.

Basically if I want to join 5 streams and I want to create a 
denormalized stream. Now the problem is if Stream 1's output for current 
window is key 1,2,3,4,5. and might happen that all the other keys have 
already emitted that key before, I can not join them with Kafka 
streams.I need to maintain the whole state for all the streams. So I 
need to figure out the key 1,2,3,4,5 from all the stream and generate a 
combined one as realtime as possible.



On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser > wrote:


Short-answer: no.

You're going to be much better off de-normalizing your five tables
into one table and eliminate the need for this JOIN.

What made you decide to want to use Phoenix in the first place?


On 4/16/18 6:04 AM, Rabin Banerjee wrote:

HI all,

I am new to phoenix, I wanted to know if I have to join 5 huge
tables where all are keyed based on the same id (i.e. one id
columns is common between all of them), is there any
optimization to add to make this join faster , as all the data
for a particular key for all 5 tables will reside in the same
region server .

To explain it bit more, suppose we have 5 streams all having a
common id that we can join with are getting stored in 5
different hbase table. And we want to join them with Phoenix but
we dont want cross region shuffle as we already know that the
key is common in all 5 tables.


Thanks //




Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

2018-04-16 Thread Rabin Banerjee
Thanks Josh !

On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser  wrote:

> Please keep communication on the mailing list.
>
> Remember that you can execute partial-row upserts with Phoenix. As long as
> you can generate the primary key from each stream, you don't need to do
> anything special in Kafka streams. You can just submit 5 UPSERTS (one for
> each stream), and the Phoenix table will eventually have the aggregated row
> when you are finished.
>
> On 4/16/18 1:30 PM, Rabin Banerjee wrote:
>
>> Actually I haven't finalised anything just looking at different options.
>>
>> Basically if I want to join 5 streams and I want to create a denormalized
>> stream. Now the problem is if Stream 1's output for current window is key
>> 1,2,3,4,5. and might happen that all the other keys have already emitted
>> that key before, I can not join them with Kafka streams.I need to maintain
>> the whole state for all the streams. So I need to figure out the key
>> 1,2,3,4,5 from all the stream and generate a combined one as realtime as
>> possible.
>>
>>
>> On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser > els...@apache.org>> wrote:
>>
>> Short-answer: no.
>>
>> You're going to be much better off de-normalizing your five tables
>> into one table and eliminate the need for this JOIN.
>>
>> What made you decide to want to use Phoenix in the first place?
>>
>>
>> On 4/16/18 6:04 AM, Rabin Banerjee wrote:
>>
>> HI all,
>>
>> I am new to phoenix, I wanted to know if I have to join 5 huge
>> tables where all are keyed based on the same id (i.e. one id
>> columns is common between all of them), is there any
>> optimization to add to make this join faster , as all the data
>> for a particular key for all 5 tables will reside in the same
>> region server .
>>
>> To explain it bit more, suppose we have 5 streams all having a
>> common id that we can join with are getting stored in 5
>> different hbase table. And we want to join them with Phoenix but
>> we dont want cross region shuffle as we already know that the
>> key is common in all 5 tables.
>>
>>
>> Thanks //
>>
>>
>>


Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

2018-04-16 Thread Pedro Boado
I guess this thread is not about kafka streams but what Josh suggested is
basically my last resource plan for building kafka streams as you'll be
constrained by HBase/Phoenix upsert ratio -you'll be doing 5x the number of
upserts-

In my experience Kafka Streams is not bad at all doing this kind of joins
-either windowed or based on ktables-. As far as you're <100M rows per
stream and have a few GB of disk space per processing node available it
should be doable.

On Mon, 16 Apr 2018, 18:49 Rabin Banerjee, 
wrote:

> Thanks Josh !
>
> On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser  wrote:
>
>> Please keep communication on the mailing list.
>>
>> Remember that you can execute partial-row upserts with Phoenix. As long
>> as you can generate the primary key from each stream, you don't need to do
>> anything special in Kafka streams. You can just submit 5 UPSERTS (one for
>> each stream), and the Phoenix table will eventually have the aggregated row
>> when you are finished.
>>
>> On 4/16/18 1:30 PM, Rabin Banerjee wrote:
>>
>>> Actually I haven't finalised anything just looking at different options.
>>>
>>> Basically if I want to join 5 streams and I want to create a
>>> denormalized stream. Now the problem is if Stream 1's output for current
>>> window is key 1,2,3,4,5. and might happen that all the other keys have
>>> already emitted that key before, I can not join them with Kafka streams.I
>>> need to maintain the whole state for all the streams. So I need to figure
>>> out the key 1,2,3,4,5 from all the stream and generate a combined one as
>>> realtime as possible.
>>>
>>>
>>> On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser >> els...@apache.org>> wrote:
>>>
>>> Short-answer: no.
>>>
>>> You're going to be much better off de-normalizing your five tables
>>> into one table and eliminate the need for this JOIN.
>>>
>>> What made you decide to want to use Phoenix in the first place?
>>>
>>>
>>> On 4/16/18 6:04 AM, Rabin Banerjee wrote:
>>>
>>> HI all,
>>>
>>> I am new to phoenix, I wanted to know if I have to join 5 huge
>>> tables where all are keyed based on the same id (i.e. one id
>>> columns is common between all of them), is there any
>>> optimization to add to make this join faster , as all the data
>>> for a particular key for all 5 tables will reside in the same
>>> region server .
>>>
>>> To explain it bit more, suppose we have 5 streams all having a
>>> common id that we can join with are getting stored in 5
>>> different hbase table. And we want to join them with Phoenix but
>>> we dont want cross region shuffle as we already know that the
>>> key is common in all 5 tables.
>>>
>>>
>>> Thanks //
>>>
>>>
>>>
>


Re: Optimisation on join in case of all the data to be joined present in the same machine (region server)

2018-04-16 Thread Josh Elser

That's a great suggestion too, Pedro!

Sounds like both are ultimately achieving the same thing. I just didn't 
know what all was possible inside of Kafka Streams ;). Thanks for sharing.


On 4/16/18 2:33 PM, Pedro Boado wrote:
I guess this thread is not about kafka streams but what Josh suggested 
is basically my last resource plan for building kafka streams as you'll 
be constrained by HBase/Phoenix upsert ratio -you'll be doing 5x the 
number of upserts-


In my experience Kafka Streams is not bad at all doing this kind of 
joins -either windowed or based on ktables-. As far as you're <100M rows 
per stream and have a few GB of disk space per processing node available 
it should be doable.


On Mon, 16 Apr 2018, 18:49 Rabin Banerjee, > wrote:


Thanks Josh !

On Mon, Apr 16, 2018 at 11:16 PM, Josh Elser mailto:els...@apache.org>> wrote:

Please keep communication on the mailing list.

Remember that you can execute partial-row upserts with Phoenix.
As long as you can generate the primary key from each stream,
you don't need to do anything special in Kafka streams. You can
just submit 5 UPSERTS (one for each stream), and the Phoenix
table will eventually have the aggregated row when you are finished.

On 4/16/18 1:30 PM, Rabin Banerjee wrote:

Actually I haven't finalised anything just looking at
different options.

Basically if I want to join 5 streams and I want to create a
denormalized stream. Now the problem is if Stream 1's output
for current window is key 1,2,3,4,5. and might happen that
all the other keys have already emitted that key before, I
can not join them with Kafka streams.I need to maintain the
whole state for all the streams. So I need to figure out the
key 1,2,3,4,5 from all the stream and generate a combined
one as realtime as possible.


On Mon, Apr 16, 2018 at 9:04 PM, Josh Elser
mailto:els...@apache.org>
>> wrote:

     Short-answer: no.

     You're going to be much better off de-normalizing your
five tables
     into one table and eliminate the need for this JOIN.

     What made you decide to want to use Phoenix in the
first place?


     On 4/16/18 6:04 AM, Rabin Banerjee wrote:

         HI all,

         I am new to phoenix, I wanted to know if I have to
join 5 huge
         tables where all are keyed based on the same id
(i.e. one id
         columns is common between all of them), is there any
         optimization to add to make this join faster , as
all the data
         for a particular key for all 5 tables will reside
in the same
         region server .

         To explain it bit more, suppose we have 5 streams
all having a
         common id that we can join with are getting stored in 5
         different hbase table. And we want to join them
with Phoenix but
         we dont want cross region shuffle as we already
know that the
         key is common in all 5 tables.


         Thanks //