Re: Salting based on partial rowkeys

2018-09-16 Thread Gerald Sangudi
Jaanai, Thomas,

Thanks for the feedback. I or my colleague will reply in this thread in the
dev list.

Gerald


On Thu, Sep 13, 2018 at 10:01 PM, Thomas D'Silva 
wrote:

> For the usage example that you provided when you write data how does the
> values of id_1, id_2 and other_key vary?
> I assume id_1 and id_2 remain the same while other_key is monotonically
> increasing, and thats why the table is salted.
> If you create the salt bucket only on id_2 then wouldn't you run into
> region server hotspotting during writes?
>
> On Thu, Sep 13, 2018 at 8:02 PM, Jaanai Zhang 
> wrote:
>
>> Sorry, I don't understander your purpose. According to your proposal, it
>> seems that can't achieve.  You need a hash partition, However,  Some things
>> need to clarify that HBase is a range partition engine and the salt buckets
>> were used to avoid hotspot, in other words, HBase as a storage engine can't
>> support hash partition.
>>
>> 
>>Jaanai Zhang
>>Best regards!
>>
>>
>>
>> Gerald Sangudi  于2018年9月13日周四 下午11:32写道:
>>
>>> Hi folks,
>>>
>>> Any thoughts or feedback on this?
>>>
>>> Thanks,
>>> Gerald
>>>
>>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi 
>>> wrote:
>>>
 Hello folks,

 We have a requirement for salting based on partial, rather than full,
 rowkeys. My colleague Mike Polcari has identified the requirement and
 proposed an approach.

 I found an already-open JIRA ticket for the same issue:
 https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
 details from the proposal.

 The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
 proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .

 The benefit at issue is that users gain more control over partitioning,
 and this can be used to push some additional aggregations and hash joins
 down to region servers.

 I would appreciate any go-ahead / thoughts / guidance / objections /
 feedback. I'd like to be sure that the concept at least is not
 objectionable. We would like to work on this and submit a patch down the
 road. I'll also add a note to the JIRA ticket.

 Thanks,
 Gerald


>>>
>


Re: Salting based on partial rowkeys

2018-09-14 Thread Sergey Soldatov
Thomas is absolutely right that there will be a possibility of hotspotting.
Salting is the mechanism that should prevent that in all cases (because all
rowids are different). The partitioning described above actually can be
implemented by using id2 as a first column of the PK and using presplit by
values without salting. The only difference will be that in the suggested
approach we don't need to know the values range for that particular
column(s). If we want to implement that (well, I remember several cases
when people asked how to presplit the table without information about the
range of values for PK columns to improve bulk load to a new table without
performance lost for some queries due salting) it would be better to
separate it from salting and call it 'partitioning' or something like
that.

Thanks,
Sergey

On Thu, Sep 13, 2018 at 10:09 PM Thomas D'Silva 
wrote:

> For the usage example that you provided when you write data how does the
> values of id_1, id_2 and other_key vary?
> I assume id_1 and id_2 remain the same while other_key is monotonically
> increasing, and thats why the table is salted.
> If you create the salt bucket only on id_2 then wouldn't you run into
> region server hotspotting during writes?
>
> On Thu, Sep 13, 2018 at 8:02 PM, Jaanai Zhang 
> wrote:
>
>> Sorry, I don't understander your purpose. According to your proposal, it
>> seems that can't achieve.  You need a hash partition, However,  Some things
>> need to clarify that HBase is a range partition engine and the salt buckets
>> were used to avoid hotspot, in other words, HBase as a storage engine can't
>> support hash partition.
>>
>> 
>>Jaanai Zhang
>>Best regards!
>>
>>
>>
>> Gerald Sangudi  于2018年9月13日周四 下午11:32写道:
>>
>>> Hi folks,
>>>
>>> Any thoughts or feedback on this?
>>>
>>> Thanks,
>>> Gerald
>>>
>>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi 
>>> wrote:
>>>
 Hello folks,

 We have a requirement for salting based on partial, rather than full,
 rowkeys. My colleague Mike Polcari has identified the requirement and
 proposed an approach.

 I found an already-open JIRA ticket for the same issue:
 https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
 details from the proposal.

 The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
 proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .

 The benefit at issue is that users gain more control over partitioning,
 and this can be used to push some additional aggregations and hash joins
 down to region servers.

 I would appreciate any go-ahead / thoughts / guidance / objections /
 feedback. I'd like to be sure that the concept at least is not
 objectionable. We would like to work on this and submit a patch down the
 road. I'll also add a note to the JIRA ticket.

 Thanks,
 Gerald


>>>
>


Re: Salting based on partial rowkeys

2018-09-14 Thread Josh Elser

Yeah, I think that's his point :)

For a fine-grained facet, the hotspotting is desirable to co-locate the 
data for query. To try to make an example to drive this point home:


Consider a primary key constraint(col1, col2, col3, col4);

If I defined the SALT_HASH based on "col1" alone, you'd get terrible 
hotspotting. However, the contrast is when we have SALT_HASH on col1, 
col2, col3, and col4, we have no row-oriented data locality (we have to 
check *all* salt buckets for every query).


If you define the SALT_HASH on col1, col2, and col3, all values for col4 
where col1-3 are fixed are co-located which would make faceted search 
queries much faster (num SALT_BUCKET RPCs down to 1 RPC).


Concretely: if I'm on Amazon searching for "water bottle" "1L size" 
"plastic composition" (col1, col2, and col3), it's really fast to give 
me "manufacturer" (col4) given my other three constraints.


Hopefully I'm getting this right too. Tell me to shut up, Gerald, if I'm 
not :)


On 9/14/18 1:01 AM, Thomas D'Silva wrote:
For the usage example that you provided when you write data how does the 
values of id_1, id_2 and other_key vary?
I assume id_1 and id_2 remain the same while other_key is monotonically 
increasing, and thats why the table is salted.
If you create the salt bucket only on id_2 then wouldn't you run into 
region server hotspotting during writes?


On Thu, Sep 13, 2018 at 8:02 PM, Jaanai Zhang > wrote:


Sorry, I don't understander your purpose. According to your
proposal, it seems that can't achieve.  You need a hash partition,
However,  Some things need to clarify that HBase is a range
partition engine and the salt buckets were used to avoid hotspot, in
other words, HBase as a storage engine can't support hash partition.


    Jaanai Zhang
    Best regards!



Gerald Sangudi mailto:gsang...@23andme.com>>
于2018年9月13日周四 下午11:32写道:

Hi folks,

Any thoughts or feedback on this?

Thanks,
Gerald

On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi
mailto:gsang...@23andme.com>> wrote:

Hello folks,

We have a requirement for salting based on partial, rather
than full, rowkeys. My colleague Mike Polcari has identified
the requirement and proposed an approach.

I found an already-open JIRA ticket for the same issue:
https://issues.apache.org/jira/browse/PHOENIX-4757
. I can
provide more details from the proposal.

The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N,
whereas Mike proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .

The benefit at issue is that users gain more control over
partitioning, and this can be used to push some additional
aggregations and hash joins down to region servers.

I would appreciate any go-ahead / thoughts / guidance /
objections / feedback. I'd like to be sure that the concept
at least is not objectionable. We would like to work on this
and submit a patch down the road. I'll also add a note to
the JIRA ticket.

Thanks,
Gerald





Re: Salting based on partial rowkeys

2018-09-13 Thread Thomas D'Silva
For the usage example that you provided when you write data how does the
values of id_1, id_2 and other_key vary?
I assume id_1 and id_2 remain the same while other_key is monotonically
increasing, and thats why the table is salted.
If you create the salt bucket only on id_2 then wouldn't you run into
region server hotspotting during writes?

On Thu, Sep 13, 2018 at 8:02 PM, Jaanai Zhang 
wrote:

> Sorry, I don't understander your purpose. According to your proposal, it
> seems that can't achieve.  You need a hash partition, However,  Some things
> need to clarify that HBase is a range partition engine and the salt buckets
> were used to avoid hotspot, in other words, HBase as a storage engine can't
> support hash partition.
>
> 
>Jaanai Zhang
>Best regards!
>
>
>
> Gerald Sangudi  于2018年9月13日周四 下午11:32写道:
>
>> Hi folks,
>>
>> Any thoughts or feedback on this?
>>
>> Thanks,
>> Gerald
>>
>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi 
>> wrote:
>>
>>> Hello folks,
>>>
>>> We have a requirement for salting based on partial, rather than full,
>>> rowkeys. My colleague Mike Polcari has identified the requirement and
>>> proposed an approach.
>>>
>>> I found an already-open JIRA ticket for the same issue:
>>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
>>> details from the proposal.
>>>
>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
>>>
>>> The benefit at issue is that users gain more control over partitioning,
>>> and this can be used to push some additional aggregations and hash joins
>>> down to region servers.
>>>
>>> I would appreciate any go-ahead / thoughts / guidance / objections /
>>> feedback. I'd like to be sure that the concept at least is not
>>> objectionable. We would like to work on this and submit a patch down the
>>> road. I'll also add a note to the JIRA ticket.
>>>
>>> Thanks,
>>> Gerald
>>>
>>>
>>


Re: Salting based on partial rowkeys

2018-09-13 Thread Jaanai Zhang
Sorry, I don't understander your purpose. According to your proposal, it
seems that can't achieve.  You need a hash partition, However,  Some things
need to clarify that HBase is a range partition engine and the salt buckets
were used to avoid hotspot, in other words, HBase as a storage engine can't
support hash partition.


   Jaanai Zhang
   Best regards!



Gerald Sangudi  于2018年9月13日周四 下午11:32写道:

> Hi folks,
>
> Any thoughts or feedback on this?
>
> Thanks,
> Gerald
>
> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi 
> wrote:
>
>> Hello folks,
>>
>> We have a requirement for salting based on partial, rather than full,
>> rowkeys. My colleague Mike Polcari has identified the requirement and
>> proposed an approach.
>>
>> I found an already-open JIRA ticket for the same issue:
>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
>> details from the proposal.
>>
>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
>>
>> The benefit at issue is that users gain more control over partitioning,
>> and this can be used to push some additional aggregations and hash joins
>> down to region servers.
>>
>> I would appreciate any go-ahead / thoughts / guidance / objections /
>> feedback. I'd like to be sure that the concept at least is not
>> objectionable. We would like to work on this and submit a patch down the
>> road. I'll also add a note to the JIRA ticket.
>>
>> Thanks,
>> Gerald
>>
>>
>


Re: Salting based on partial rowkeys

2018-09-13 Thread Gerald Sangudi
Hi folks,

Any thoughts or feedback on this?

Thanks,
Gerald

On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi 
wrote:

> Hello folks,
>
> We have a requirement for salting based on partial, rather than full,
> rowkeys. My colleague Mike Polcari has identified the requirement and
> proposed an approach.
>
> I found an already-open JIRA ticket for the same issue:
> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
> details from the proposal.
>
> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .
>
> The benefit at issue is that users gain more control over partitioning,
> and this can be used to push some additional aggregations and hash joins
> down to region servers.
>
> I would appreciate any go-ahead / thoughts / guidance / objections /
> feedback. I'd like to be sure that the concept at least is not
> objectionable. We would like to work on this and submit a patch down the
> road. I'll also add a note to the JIRA ticket.
>
> Thanks,
> Gerald
>
>


Salting based on partial rowkeys

2018-09-10 Thread Gerald Sangudi
Hello folks,

We have a requirement for salting based on partial, rather than full,
rowkeys. My colleague Mike Polcari has identified the requirement and
proposed an approach.

I found an already-open JIRA ticket for the same issue:
https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
details from the proposal.

The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .

The benefit at issue is that users gain more control over partitioning, and
this can be used to push some additional aggregations and hash joins down
to region servers.

I would appreciate any go-ahead / thoughts / guidance / objections /
feedback. I'd like to be sure that the concept at least is not
objectionable. We would like to work on this and submit a patch down the
road. I'll also add a note to the JIRA ticket.

Thanks,
Gerald