The keyspace is just random strings of length > 10. I'll try to reuse HBase utility to split my string keyspace like they do it for bytes.
2016-02-02 23:23 GMT+01:00 Mujtaba Chohan <[email protected]>: > If you know your key space then you can use *SPLIT ON* in your table > create DDL. See http://phoenix.apache.org/language > > On Tue, Feb 2, 2016 at 11:54 AM, Serega Sheypak <[email protected]> > wrote: > >> Hm... and what is the right to presplit table then? >> >> 2016-02-02 18:30 GMT+01:00 Mujtaba Chohan <[email protected]>: >> >>> If your filter matches few rows due to filter on leading part of PK then >>> your data might only reside in a single block which leads to less >>> overall disk reads for non-salted case vs need for multiple blocks reads for >>> salted one. >>> >>> >>> On Tuesday, February 2, 2016, Serega Sheypak <[email protected]> >>> wrote: >>> >>>> > then you would be better off not using salt buckets all together >>>> rather than having 100 parallel scan and block reads in your case. I >>>> Didn't understand you correctly. What is difference between salted/not >>>> salted table in case of "primary key leading-part select"? >>>> >>>> 2016-02-02 1:18 GMT+01:00 Mujtaba Chohan <[email protected]>: >>>> >>>>> If you are filtering on leading part of row key which is highly >>>>> selective then you would be better off not using salt buckets all together >>>>> rather than having 100 parallel scan and block reads in your case. In our >>>>> test with billion+ row table, non-salted table offer much better >>>>> performance since it ends up reading fewer blocks from a single region. >>>>> >>>>> //mujtaba >>>>> >>>>> On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, here is my table DDL: >>>>>> CREATE TABLE IF NOT EXISTS id_ref >>>>>> ( >>>>>> id1 VARCHAR NOT NULL, >>>>>> value1 VARCHAR, >>>>>> >>>>>> id2 VARCHAR NOT NULL, >>>>>> value2 VARCHAR >>>>>> CONSTRAINT id_ref_pk PRIMARY KEY (id1, id2) >>>>>> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200 >>>>>> >>>>>> I'm trying to analyze result of explain: >>>>>> >>>>>> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx' >>>>>> >>>>>> . . . . . . . . . . . . . . . . . . . . . . .> ; >>>>>> >>>>>> *+------------------------------------------+* >>>>>> >>>>>> *| ** PLAN ** |* >>>>>> >>>>>> *+------------------------------------------+* >>>>>> >>>>>> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF >>>>>> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |* >>>>>> >>>>>> *| *CLIENT MERGE SORT * |* >>>>>> >>>>>> *+------------------------------------------+* >>>>>> >>>>>> >>>>>> What happens? Client spawns 100 parallel scans (because of bucketing) >>>>>> and waits for 100 responses? >>>>>> >>>>>> Is it effective? What is the right way to optimize such query >>>>>> pattern: "select by first part of primary key"? Reduce the amount of >>>>>> buckets? I get exeption a while after restarting app: >>>>>> >>>>>> >>>>>> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644 >>>>>> rejected from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool >>>>>> size = 128, active threads = 121, queued tasks = 5000, completed tasks = >>>>>> 2629565* >>>>>> >>>>>> >>>>>> >>>>> >>>> >> >
