[jira] [Comment Edited] (PHOENIX-3560) Aggregate query performance is worse with encoded columns for schema with large number of columns

Lars Hofhansl (JIRA) Tue, 10 Jan 2017 22:04:21 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817308#comment-15817308
 ]


Lars Hofhansl edited comment on PHOENIX-3560 at 1/11/17 6:03 AM:
-----------------------------------------------------------------

The FirstKeyOnlyFilter would still work and be effective, it's just that HBase 
cannot effectively seek over as many bytes as before (in the simple COUNT\(*) 
case)

Imagine a row with 10000 columns, each 50 bytes. The total size would be 500KB. 
In the count start case we can use the FirstKeyOnlyFilter. Without encoding, 
HBase loads the first block, which will be (by default) be as most 65K + 
499bytes (let's just say 64K)... So it will load the block, look at the first 
key of the first KeyValue and then seek to the next row.. I.e. the first 
KeyValue of the next row, now it can seek past the rest of the 500KB without 
ever loading those blocks.

In the encoded case, the first block would be 500KB in size, since HBase will 
not break up a KeyValue between blocks, so HBase has to load the 500KB, in 
order to read the first key of the first KeyValue.

I do not see a way out of this, other than saying that this is fairly 
constructed case.
The default blocksize is 64KB, the default maximum KeyValue (Cell) size is 1MB. 
So if the row size fall between these size simple scans like COUNT\(*) might be 
slower.

[~samarthjain], how is the encoding dealing with the 1MB limit? Does it (1) 
simply fail, or will it (2) split the encoding into multiple Cells accordingly? 
If the latter, one could simply do at smaller sizes.



was (Author: lhofhansl):
The FirstKeyOnlyFilter would still work and be effective, it's just that HBase 
cannot effectively seek over as many bytes as before (in the simple COUNT\(*) 
case)

Imagine a row with 10000 columns, each 50 bytes. The total size would be 500KB. 
In the count start case we can use the FirstKeyOnlyFilter. Without encoding, 
HBase loads the first block, which will be (by default) be as most 65K + 
499bytes (let's just say 64K)... So it will load the block, look at the first 
key of the first KeyValue and then seek to the next row.. I.e. the first 
KeyValue of the next row, now it can seek past the rest of the 500KB without 
ever loading those blocks.

In the encoded case, the first block would be 500KB in size, since HBase will 
not break up a KeyValue between blocks, so HBase has to load the 500KB, in 
order to read the first key of the first KeyValue.

I do not see a way out of this, other than saying that this is fairly 
constructed case.
The default blocksize is 64KB, the default maximum KeyValue (Cell) size is 1MB. 
So if the row size fall between these size simple scans like COUNT(*) might be 
slower.

[~samarthjain], how is the encoding dealing with the 1MB limit? Does it (1) 
simply fail, or will it (2) split the encoding into multiple Cells accordingly? 
If the latter, one could simply do at smaller sizes.


> Aggregate query performance is worse with encoded columns for schema with 
> large number of columns
> -------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3560
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3560
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>             Fix For: 4.10.0
>
>         Attachments: DataGenerator.java, PHOENIX-3565.patch
>
>
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT 
> PK PRIMARY KEY (K1, K2)) 
> VERSIONS=1, MULTI_TENANT=true, IMMUTABLE_ROWS=true
> {noformat}
> In this test, there are no null columns and each column contains 200 chars 
> i.e. 1MB of data per row.
> Count * aggregation is about 5X slower with encoded columns when compared to 
> table non-encoded columns using the same schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PHOENIX-3560) Aggregate query performance is worse with encoded columns for schema with large number of columns

Reply via email to