Actually, sorry, but I spoke too soon. The reason skip scan is not being
used is because it's only used when you specify all of the pk columns in an
IN clause. Since k4 is in the PK constraint, but not included, it won't be
used.

The SkipScanFilter could be enhanced to handle this if you're interested in
contributing. There's a JIRA-to-be  here:
https://github.com/forcedotcom/phoenix/issues/176. FWIW, the skip scan does
handle a subset of your PK columns separated by ANDs that are in IN
clauses, like this: WHERE k1 IN ('a','b') AND k2 IN ('c','d')

Thanks,
James


On Wed, Feb 26, 2014 at 10:18 PM, James Taylor <[email protected]>wrote:

> Hi Dana,
> Comments inline below. Thanks,
>      James
>
>
> On Wed, Feb 26, 2014 at 11:00 AM, Livni, Dana <[email protected]>wrote:
>
>>
>>
>> Hi,
>>
>>
>>
>> I have a table with 50M records (primary key is compose of 4 columns) and
>> there is  only one other column.
>>
>> The definition of the table is:
>>
>> CREATE TABLE MY_TABLE ("key1" VARCHAR, "key2" VARCHAR, "key3" VARCHAR,
>> "key4" VARCHAR,"data"."valueColumn" VARCHAR CONSTRAINT pk PRIMARY KEY
>> ("key1","key2","key3","key4"));
>>
>>
>>
>> I have several issues:
>>
>> 1.      when preforming  a query on this table using the 3 first keys:
>>
>> select count(*) from MY_TABLE where (("key1","key2","key3") in
>> (('k1','k2','k3')); this query counts 1.5M records and it run time is about
>> 3 seconds, does it sounds reasonable? Is there a way to improve it?
>>
> I'd play around with two things: salting (
> http://phoenix.incubator.apache.org/salted.html) and your block size.
> Just tack on a SALT_BUCKETS=16 to your DDL and rebuild your table. Salting
> helps by evenly distributing your read and write count. You could
> alternatively pre-split your table by adding a SPLIT ON clause to the end
> of your DDL. Setting the block size lower is a trade off. It helps random
> read times, but makes compaction more expensive as your table grows.
>
>
>>  Is there any hbase phoenix configuration I should play with? Any other
>> ideas?
>>
>>
>>
>> 2.      if I change the query to use other aggregative function
>>
>> select min("valueColumn") from MY_TABLE where (("key1","key2","key3") in
>> (('k1','k2','k3')); the run time increase significantly to around 8 seconds
>> I can see in the execution plan that the different comes from not using the
>> first row key filter, is there a way to improve the performance? At least
>> make it the same as the count query.
>>
> There was a bug here, causing a range scan to occur instead of a skip scan
> for row value constructors. Can you try with the latest RC at
> http://people.apache.org/~jamestaylor/phoenix-2.2.3-incubating-RC3/ and
> let us know if you see improvements? The explain plan should say SKIP SCAN
> instead of RANGE SCAN.
>
>>
>>  3.      When I use multiple in phrases:
>>
>> select count(*) from MY_TABLE where (("key1","key2","key3") in
>> (('k1','k2','k3'), ('k4','k5','k6')); This query return twice the record
>> size (3M records) but in 4th the time around 14 seconds.
>>
>> I can see that the execution plan changes from RANGE SCAN OVER
>> DEVICE_MEASUREMENT ['k1','k2','k3'] - [*] SERVER FILTER BY FIRST KEY ONLY
>> To RANGE SCAN OVER DEVICE_MEASUREMENT ['k1\x00k2\x00k3'] -
>> ['k4\x00k5\x00k6'] SERVER FILTER BY FIRST KEY ONLY AND (key1, key2, key3)
>> IN
>> ([114,101,112,111,114,116,0,72,117,109,105,100,105,116,121,0,100,97,110,97,49,48,126,115,101,108,102],[114,101,
>>
>>                 Is there a way to make it use the scan differently? am I
>> doing something wrong? Again is there a way to improve the execution time?
>>
> Same as above - try with the latest.
>
>
>>  4. joins:
>>
>> When trying to perform a join between this big table (50M records) to a
>> much smaller table (10 records) The structure of the two tables is:
>>
>> CREATE TABLE MY_TABLE ("key1" VARCHAR, "key2" VARCHAR, "key3" VARCHAR,
>> "key4" VARCHAR,"data"."valueColumn" VARCHAR CONSTRAINT pk PRIMARY KEY
>> ("key1","key2","key3","key4")); CREATE TABLE SMALL_TABLE ("key1" VARCHAR,
>> "key3" VARCHAR,"data"."property" VARCHAR CONSTRAINT pk PRIMARY KEY
>> ("key1","key3")); Each record in the small table have approximately 4.5M
>> records connected to it.
>>
>> I'm using the master version from the 29-01.
>>
>> When trying to run the following join:
>>
>> Select count(*) from SMALL_TABLE a join MY_TABLE b on a."key1"=b."key1"
>> and a."key3"=b."key3" Where b."property" = 'val'
>>
>> I can see in the execution plan that a full san is being preform on the
>> big table (MY_TABLE) hence the run time is very long around 1min.
>>
> Put the smaller table last in the join. Since we don't have stats yet to
> drive this, you should list tables from biggest to smallest, as they're
> processed in reverse order.
>
>>  I have tried to add an index on the small table on the property column
>> but it didn't change the fact that a full scan was preform on the big table
>> and the run time stayed long.
>>
>> Is there a way to improve this?  Are there any big changes expected in
>> this area in future versions? Any estimations when?
>>
>>
>>
>> Thanks in advance
>>
>> Dana.
>>
> HTH. Thanks,
> James
>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>
>

Reply via email to