I love to see these ORC table optimization help but it is not obvious to me under what circumstances they bare fruit.
Case in point. I have an ORC table with 100 Million rows created as follows: CREATE TABLE `dummy`( `id` int, `clustered` int, `scattered` int, `randomised` int, `random_string` varchar(50), `small_vc` varchar(10), `padding` varchar(10)) CLUSTERED BY ( id) INTO 256 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='35', 'numRows'='100000000', 'orc.bloom.filter.columns'='ID', 'orc.bloom.filter.fpp'='0.05', 'orc.compress'='SNAPPY', 'orc.create.index'='true', 'orc.row.index.stride'='10000', 'orc.stripe.size'='16777216', 'rawDataSize'='33800000000', 'totalSize'='5660813776', 'transient_lastDdlTime'='1454234981') So I have hash partitioning on ID column. I am trying to get the whole row for ID = 20. Pretty straight forward hive> explain extended select * from dummy where id = 20; OK ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME dummy TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE = TOK_TABLE_OR_COL id 20 STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark DagName: hduser_20160316161245_5f6000d0-cc23-4400-a6fd-42de3915f566:3 Vertices: Map 1 Map Operator Tree: TableScan alias: dummy Statistics: Num rows: 100000000 Data size: 33800000000 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: (id = 20) (type: boolean) Statistics: Num rows: 50000000 Data size: 16900000000 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 20 (type: int), clustered (type: int), scattered (type: int), randomised (type: int), random_string (type: varchar(50)), small_vc (type: varchar(10)), padding (type: varchar(10)) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 50000000 Data size: 16900000000 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002 NumFilesPerFileSink: 1 Statistics: Num rows: 50000000 Data size: 16900000000 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1,_col2,_col3,_col4,_col5,_col6 columns.types int:int:int:int:varchar(50):varchar(10):varchar(10) escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Path -> Alias: hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy [dummy] Path -> Partition: hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy Partition base file name: dummy input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count 256 bucket_field_name id columns id,clustered,scattered,randomised,random_string,small_vc,padding columns.comments columns.types int:int:int:int:varchar(50):varchar(10):varchar(10) file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat location hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy name test.dummy numFiles 35 numRows 100000000 orc.bloom.filter.columns ID orc.bloom.filter.fpp 0.05 orc.compress SNAPPY orc.create.index true orc.row.index.stride 10000 orc.stripe.size 16777216 rawDataSize 33800000000 serialization.ddl struct dummy { i32 id, i32 clustered, i32 scattered, i32 randomised, varchar(50) random_string, varchar(10) small_vc, varchar(10) padding} serialization.format 1 serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde totalSize 5660813776 transient_lastDdlTime 1454234981 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count 256 bucket_field_name id columns id,clustered,scattered,randomised,random_string,small_vc,padding columns.comments columns.types int:int:int:int:varchar(50):varchar(10):varchar(10) file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat location hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy name test.dummy numFiles 35 numRows 100000000 orc.bloom.filter.columns ID orc.bloom.filter.fpp 0.05 orc.compress SNAPPY orc.create.index true orc.row.index.stride 10000 orc.stripe.size 16777216 rawDataSize 33800000000 serialization.ddl struct dummy { i32 id, i32 clustered, i32 scattered, i32 randomised, varchar(50) random_string, varchar(10) small_vc, varchar(10) padding} serialization.format 1 serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde totalSize 5660813776 transient_lastDdlTime 1454234981 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: test.dummy name: test.dummy Truncated Path -> Alias: /test.db/dummy [dummy] Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.081 seconds, Fetched: 140 row(s) This line TableScan is very misleading as soon as I see it, it gives the impression that it is going through the whole table. Somehow it decided to divide the table into two chunks of 50 Million rows each. I have set "set hive.optimize.ppd=true;" as well but still it does not appear to me if it is using statistics. May be I am missing something here. It is possible that there are certain parameters that should be turned on when using ORC table? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 16 March 2016 at 14:18, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > I have tried bloom filter ,but it makes no improvement。I know about > > tez, but never use, I will try it later. > ... > > select count(*) from gprs where terminal_type=25080; > > will not scan data > > Time taken: 353.345 seconds > > CombineInputFormat does not do any split-elimination, so MapReduce does > not get container speedups there. > > Most of your ~300s looks to be the fixed overheads of setting up each task. > > We could not fix this in MRv2 due to historical compatibility issues with > merge-joins & schema evolution (see HiveSplitGenerator.java). > > This is not recommended for regular use (other than in Tez), but you can > force split-elimination with > > > set hive.input.format=${hive.tez.input.format}; > > >>>> So, has anyone used ORC's build-in indexes before (especially in > >>>>spark SQL)? What's my issue? > > We work on SparkSQL perf issues as well - this has to do with OrcRelation > > https://github.com/apache/spark/pull/10938 > > + > https://github.com/apache/spark/pull/10842 > > > Cheers, > Gopal > > >