Hi,
I did some tests on ORC table by creating a simple ORC table as below
CREATE TABLE orctest
(
PROD_ID bigint ,
CUST_ID bigint ,
TIME_ID timestamp ,
CHANNEL_ID bigint ,
PROMO_ID bigint ,
QUANTITY_SOLD decimal(10) ,
AMOUNT_SOLD decimal(10)
)
CLUSTERED BY (PROD_ID) INTO 256 BUCKETS
STORED AS ORC
--TBLPROPERTIES ( "orc.compress"="SNAPPY",
TBLPROPERTIES (
"orc.create.index"="true",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
show create table orctest;
insert into orctest
select * from smallsales limit 5000000
;
analyze table orctest compute statistics;
I deliberately set the stripe size to be 16MB and row index size to be
10,000
Table is created with 5Million rows as below
NFO : Table oraclehadoop.orctest stats: [numFiles=1, numRows=5000000,
totalSize=32140689, rawDataSize=1480000000]
Now if I dump the content of ORC file I see as below
Questions I have:
1. ORC will use be default ZLIB. That means that oRC file will use some
fo
2. I specified the stripe size of 16MB. It has created 3 stripes of 172,0000
+ 172,0000 + 156,000 rows. Does it decide on the number of stripes based on
the total table size which seems to be 30MB (File length: 32140689 bytes)
3. There is no mention of statistics for row index in batch of 10,000? Is
this correct
4. Finally under Stripes:what are those statistics? Are they offset values?
hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0
Compression: ZLIB
Compression size: 262144
Type:
struct<_col0:bigint,_col1:bigint,_col2:timestamp,_col3:bigint,_col4:bigint,_
col5:decimal(10,0),_col6:decimal(10,0)>
Stripe Statistics:
Stripe 1:
Column 0: count: 1720000 hasNull: false
Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134770570
Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum:
12541768461
Column 3: count: 1720000 hasNull: false min: 883612800000 max:
1009756800000
Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4923112
Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678552095
Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000
Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184490864
Stripe 2:
Column 0: count: 1720000 hasNull: false
Column 1: count: 1720000 hasNull: false min: 13 max: 148 sum: 134793536
Column 2: count: 1720000 hasNull: false min: 2 max: 101000 sum:
12545167155
Column 3: count: 1720000 hasNull: false min: 883612800000 max:
1009756800000
Column 4: count: 1720000 hasNull: false min: 2 max: 9 sum: 4924236
Column 5: count: 1720000 hasNull: false min: 33 max: 999 sum: 1678720214
Column 6: count: 1720000 hasNull: false min: 1 max: 1 sum: 1720000
Column 7: count: 1720000 hasNull: false min: 6 max: 1783 sum: 184666196
Stripe 3:
Column 0: count: 1560000 hasNull: false
Column 1: count: 1560000 hasNull: false min: 13 max: 148 sum: 122265414
Column 2: count: 1560000 hasNull: false min: 2 max: 101000 sum:
11367296072
Column 3: count: 1560000 hasNull: false min: 883612800000 max:
1009756800000
Column 4: count: 1560000 hasNull: false min: 2 max: 9 sum: 4465701
Column 5: count: 1560000 hasNull: false min: 33 max: 999 sum: 1522502365
Column 6: count: 1560000 hasNull: false min: 1 max: 1 sum: 1560000
Column 7: count: 1560000 hasNull: false min: 6 max: 1783 sum: 167524289
File Statistics:
Column 0: count: 5000000 hasNull: false
Column 1: count: 5000000 hasNull: false min: 13 max: 148 sum: 391829520
Column 2: count: 5000000 hasNull: false min: 2 max: 101000 sum:
36454231688
Column 3: count: 5000000 hasNull: false min: 883612800000 max:
1009756800000
Column 4: count: 5000000 hasNull: false min: 2 max: 9 sum: 14313049
Column 5: count: 5000000 hasNull: false min: 33 max: 999 sum: 4879774674
Column 6: count: 5000000 hasNull: false min: 1 max: 1 sum: 5000000
Column 7: count: 5000000 hasNull: false min: 6 max: 1783 sum: 536681349
Stripes:
Stripe: offset: 3 data: 11039881 rows: 1720000 tail: 137 index: 14370
Stream: column 0 section ROW_INDEX start: 3 length 30
Stream: column 1 section ROW_INDEX start: 33 length 1853
Stream: column 2 section ROW_INDEX start: 1886 length 2415
Stream: column 3 section ROW_INDEX start: 4301 length 2284
Stream: column 4 section ROW_INDEX start: 6585 length 1451
Stream: column 5 section ROW_INDEX start: 8036 length 1674
Stream: column 6 section ROW_INDEX start: 9710 length 1714
Stream: column 7 section ROW_INDEX start: 11424 length 2949
Stream: column 1 section DATA start: 14373 length 1797330
Stream: column 2 section DATA start: 1811703 length 3280925
Stream: column 3 section DATA start: 5092628 length 3592057
Stream: column 3 section SECONDARY start: 8684685 length 1797
Stream: column 4 section DATA start: 8686482 length 590101
Stream: column 5 section DATA start: 9276583 length 124066
Stream: column 6 section DATA start: 9400649 length 7653
Stream: column 6 section SECONDARY start: 9408302 length 1797
Stream: column 7 section DATA start: 9410099 length 1642358
Stream: column 7 section SECONDARY start: 11052457 length 1797
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Stripe: offset: 11054391 data: 11041929 rows: 1720000 tail: 137 index:
14292
Stream: column 0 section ROW_INDEX start: 11054391 length 30
Stream: column 1 section ROW_INDEX start: 11054421 length 1858
Stream: column 2 section ROW_INDEX start: 11056279 length 2398
Stream: column 3 section ROW_INDEX start: 11058677 length 2242
Stream: column 4 section ROW_INDEX start: 11060919 length 1443
Stream: column 5 section ROW_INDEX start: 11062362 length 1669
Stream: column 6 section ROW_INDEX start: 11064031 length 1714
Stream: column 7 section ROW_INDEX start: 11065745 length 2938
Stream: column 1 section DATA start: 11068683 length 1797175
Stream: column 2 section DATA start: 12865858 length 3283080
Stream: column 3 section DATA start: 16148938 length 3591894
Stream: column 3 section SECONDARY start: 19740832 length 1797
Stream: column 4 section DATA start: 19742629 length 590201
Stream: column 5 section DATA start: 20332830 length 123640
Stream: column 6 section DATA start: 20456470 length 7653
Stream: column 6 section SECONDARY start: 20464123 length 1797
Stream: column 7 section DATA start: 20465920 length 1642895
Stream: column 7 section SECONDARY start: 22108815 length 1797
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
Stripe: offset: 22110749 data: 10016151 rows: 1560000 tail: 138 index:
13096
Stream: column 0 section ROW_INDEX start: 22110749 length 30
Stream: column 1 section ROW_INDEX start: 22110779 length 1704
Stream: column 2 section ROW_INDEX start: 22112483 length 2201
Stream: column 3 section ROW_INDEX start: 22114684 length 2051
Stream: column 4 section ROW_INDEX start: 22116735 length 1333
Stream: column 5 section ROW_INDEX start: 22118068 length 1534
Stream: column 6 section ROW_INDEX start: 22119602 length 1552
Stream: column 7 section ROW_INDEX start: 22121154 length 2691
Stream: column 1 section DATA start: 22123845 length 1628303
Stream: column 2 section DATA start: 23752148 length 2980652
Stream: column 3 section DATA start: 26732800 length 3257688
Stream: column 3 section SECONDARY start: 29990488 length 1633
Stream: column 4 section DATA start: 29992121 length 535364
Stream: column 5 section DATA start: 30527485 length 112530
Stream: column 6 section DATA start: 30640015 length 6934
Stream: column 6 section SECONDARY start: 30646949 length 1633
Stream: column 7 section DATA start: 30648582 length 1489781
Stream: column 7 section SECONDARY start: 32138363 length 1633
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
Encoding column 4: DIRECT_V2
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
File length: 32140689 bytes
Padding length: 0 bytes
Padding ratio: 0%
Dr Mich Talebzadeh
LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.