I love to see these ORC table optimization help but it is not obvious to me
under what circumstances they bare fruit.
Case in point. I have an ORC table with 100 Million rows created as follows:
CREATE TABLE `dummy`(
`id` int,
`clustered` int,
`scattered` int,
`randomised` int,
`random_string` varchar(50),
`small_vc` varchar(10),
`padding` varchar(10))
CLUSTERED BY (
id)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='35',
'numRows'='100000000',
'orc.bloom.filter.columns'='ID',
'orc.bloom.filter.fpp'='0.05',
'orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.row.index.stride'='10000',
'orc.stripe.size'='16777216',
'rawDataSize'='33800000000',
'totalSize'='5660813776',
'transient_lastDdlTime'='1454234981')
So I have hash partitioning on ID column.
I am trying to get the whole row for ID = 20. Pretty straight forward
hive> explain extended select * from dummy where id = 20;
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
dummy
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
TOK_ALLCOLREF
TOK_WHERE
=
TOK_TABLE_OR_COL
id
20
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
DagName: hduser_20160316161245_5f6000d0-cc23-4400-a6fd-42de3915f566:3
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: dummy
Statistics: Num rows: 100000000 Data size: 33800000000
Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (id = 20) (type: boolean)
Statistics: Num rows: 50000000 Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 20 (type: int), clustered (type: int),
scattered (type: int), randomised (type: int), random_string (type:
varchar(50)), small_vc (type: varchar(10)), padding (type: varchar(10))
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6
Statistics: Num rows: 50000000 Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory:
hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002
NumFilesPerFileSink: 1
Statistics: Num rows: 50000000 Data size:
16900000000 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix:
hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002/
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns
_col0,_col1,_col2,_col3,_col4,_col5,_col6
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Path -> Alias:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy [dummy]
Path -> Partition:
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
Partition
base file name: dummy
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count 256
bucket_field_name id
columns
id,clustered,scattered,randomised,random_string,small_vc,padding
columns.comments
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
name test.dummy
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 33800000000
serialization.ddl struct dummy { i32 id, i32 clustered,
i32 scattered, i32 randomised, varchar(50) random_string, varchar(10)
small_vc, varchar(10) padding}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 5660813776
transient_lastDdlTime 1454234981
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count 256
bucket_field_name id
columns
id,clustered,scattered,randomised,random_string,small_vc,padding
columns.comments
columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
name test.dummy
numFiles 35
numRows 100000000
orc.bloom.filter.columns ID
orc.bloom.filter.fpp 0.05
orc.compress SNAPPY
orc.create.index true
orc.row.index.stride 10000
orc.stripe.size 16777216
rawDataSize 33800000000
serialization.ddl struct dummy { i32 id, i32
clustered, i32 scattered, i32 randomised, varchar(50) random_string,
varchar(10) small_vc, varchar(10) padding}
serialization.format 1
serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 5660813776
transient_lastDdlTime 1454234981
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: test.dummy
name: test.dummy
Truncated Path -> Alias:
/test.db/dummy [dummy]
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.081 seconds, Fetched: 140 row(s)
This line
TableScan
is very misleading as soon as I see it, it gives the impression that it is
going through the whole table.
Somehow it decided to divide the table into two chunks of 50 Million rows
each. I have set "set hive.optimize.ppd=true;" as well but still it does
not appear to me if it is using statistics. May be I am missing something
here.
It is possible that there are certain parameters that should be turned on
when using ORC table?
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 16 March 2016 at 14:18, Gopal Vijayaraghavan <[email protected]> wrote:
>
> > I have tried bloom filter ,but it makes no improvement。I know about
> > tez, but never use, I will try it later.
> ...
> > select count(*) from gprs where terminal_type=25080;
> > will not scan data
> > Time taken: 353.345 seconds
>
> CombineInputFormat does not do any split-elimination, so MapReduce does
> not get container speedups there.
>
> Most of your ~300s looks to be the fixed overheads of setting up each task.
>
> We could not fix this in MRv2 due to historical compatibility issues with
> merge-joins & schema evolution (see HiveSplitGenerator.java).
>
> This is not recommended for regular use (other than in Tez), but you can
> force split-elimination with
>
>
> set hive.input.format=${hive.tez.input.format};
>
> >>>> So, has anyone used ORC's build-in indexes before (especially in
> >>>>spark SQL)? What's my issue?
>
> We work on SparkSQL perf issues as well - this has to do with OrcRelation
>
> https://github.com/apache/spark/pull/10938
>
> +
> https://github.com/apache/spark/pull/10842
>
>
> Cheers,
> Gopal
>
>
>