Thank you all.
1.
The other query is:
select keyword, sum(cnt) as count
from my_table
where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
to_timestamp('2017-04-03 23:00')
and *obj_id in ( obj_ids comes here... )*
and obj_grp = '100'
group by keyword
order by count desc;
This query is pretty fast because of the first two PKs.
The reason obj_id comes at second PK is, we first assumed 'select with
obj_id' will be called much more often, and then 'select with obj_grp' is
also required.
2.
4.8.0-HBase-1.1 version is used.
I'm testing on HBase server for our product, so that upgrading looks not
easy. : (
If there definitely exist index bug fixes, I could try though.
3.
With Jonathan's reply and optional PK, this schema looks worthy to try:
create table my_table (
obj_grp varchar(50),
obj_id varchar(50) not null,
regymdt timestamp not null,
keyword varchar not null,
cnt integer,
post_tp varchar(20)
constraint pk primary key (obj_grp, obj_id, regymdt, keyword)
)
Thanks,
NaHeon
2017-04-05 15:17 GMT+09:00 Jonathan Leech :
> Also, your schema as defined seems to have a pretty common hbase
> anti-pattern, using a steadily increasing value as the first part of the
> rowkey (assuming your data is also loaded in order of time). This will lead
> to region hotspotting on the load, region splitting, etc. Additionally,
> your queries may not take advantage of the parallelism of the cluster if
> they end up on a small number of regions. If the queries are more likely to
> be on the most recent data, as is often the case, you'll end up with a
> single server doing the bulk the work in the cluster.
>
> - Jonathan
>
> On Apr 4, 2017, at 10:10 PM, James Taylor wrote:
>
> What other queries do you want to be fast? What version of Phoenix are you
> using? Have you seen our new Tuning Guide [1]?
>
> You could try moving OBJ_ID to end of PK and adding OBJ_GRP as the
> leading PK column (it's fine to include it in PK even if it's optional).
>
> Your index should be used, though. Can you try doing an explain on the
> query with 4.10 and file a JIRA if the index isn't being used?
>
> Thanks,
> James
>
> [1] http://phoenix.apache.org/tuning_guide.html
>
> On Tue, Apr 4, 2017 at 8:01 PM NaHeon Kim wrote:
>
>> Hi all,
>>
>> My team has a Phoenix table containing over 30,000,000 rows and try to
>> speed up its query performance.
>> It'll be perfect if all queries could be done within 1~2 seconds.
>>
>> - table schema -
>>
>> CREATE TABLE MY_TABLE (
>>REGYMDT timestamp not null,
>>OBJ_ID varchar(50) not null,
>>KEYWORD varchar not null,
>>OBJ_GRP varchar(50),
>>CNT integer,
>>POST_TP varchar(20),
>>CONSTRAINT PK PRIMARY KEY (REGYMDT, OBJ_ID, KEYWORD)
>> ) IMMUTABLE_ROWS = true;
>>
>> create index objgrp_reg_kwd on my_table(obj_grp, regymdt, keyword)
>> include (cnt);
>>
>> - tuning-needed query -
>>
>> select keyword, sum(cnt) as count
>> from my_table
>> where regymdt >= to_timestamp('2017-04-03 00:00') and regymdt <=
>> to_timestamp('2017-04-03 23:00')
>> and obj_grp = '100'
>> group by keyword
>> order by count desc;
>>
>>
>> The reason why obj_grp isn't PK is that it's optional value.
>> I could use obj_grp's default value and propagate it to PK, but still not
>> think it's desired pattern.
>>
>> The query above takes more than 20 seconds!
>> It does *not use* objgrp_reg_kwd index, doing just regymdt range scan.
>>
>> Thanks in advance!
>> NaHeon
>>
>>