Re: The build-in indexes in ORC file does not work.

Mich Talebzadeh Fri, 18 Mar 2016 23:08:35 -0700

I love to see these ORC table optimization help but it is not obvious to me
under what circumstances they bare fruit.


Case in point. I have an ORC table with 100 Million rows created as follows:

CREATE TABLE `dummy`(
  `id` int,
  `clustered` int,
  `scattered` int,
  `randomised` int,
  `random_string` varchar(50),
  `small_vc` varchar(10),
  `padding` varchar(10))
CLUSTERED BY (
  id)
INTO 256 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='35',
  'numRows'='100000000',
  'orc.bloom.filter.columns'='ID',
  'orc.bloom.filter.fpp'='0.05',
  'orc.compress'='SNAPPY',
  'orc.create.index'='true',
  'orc.row.index.stride'='10000',
  'orc.stripe.size'='16777216',
  'rawDataSize'='33800000000',
  'totalSize'='5660813776',
  'transient_lastDdlTime'='1454234981')

So I have hash partitioning on ID column.

I am trying to get the whole row for ID = 20. Pretty straight forward

hive> explain extended select * from dummy where id = 20;
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            dummy
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_ALLCOLREF
      TOK_WHERE
         =
            TOK_TABLE_OR_COL
               id
            20

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      DagName: hduser_20160316161245_5f6000d0-cc23-4400-a6fd-42de3915f566:3
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 33800000000
Basic stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: (id = 20) (type: boolean)
                    Statistics: Num rows: 50000000 Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: 20 (type: int), clustered (type: int),
scattered (type: int), randomised (type: int), random_string (type:
varchar(50)), small_vc (type: varchar(10)), padding (type: varchar(10))
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6
                      Statistics: Num rows: 50000000 Data size: 16900000000
Basic stats: COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        GlobalTableId: 0
                        directory:
hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002
                        NumFilesPerFileSink: 1
                        Statistics: Num rows: 50000000 Data size:
16900000000 Basic stats: COMPLETE Column stats: NONE
                        Stats Publishing Key Prefix:
hdfs://rhes564:9000/tmp/hive/hduser/9f78a143-4e82-4348-8d26-5dc8a6ee25ee/hive_2016-03-16_16-41-04_865_8057241516764344070-1/-mr-10001/.hive-staging_hive_2016-03-16_16-41-04_865_8057241516764344070-1/-ext-10002/
                        table:
                            input format:
org.apache.hadoop.mapred.TextInputFormat
                            output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            properties:
                              columns
_col0,_col1,_col2,_col3,_col4,_col5,_col6
                              columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
                              escape.delim \

hive.serialization.extend.additional.nesting.levels true
                              serialization.escape.crlf true
                              serialization.format 1
                              serialization.lib
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                            serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                        TotalFiles: 1
                        GatherStats: false
                        MultiFileSpray: false
            Path -> Alias:
              hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy [dummy]
            Path -> Partition:
              hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
                Partition
                  base file name: dummy
                  input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                  properties:
                    COLUMN_STATS_ACCURATE true
                    bucket_count 256
                    bucket_field_name id
                    columns
id,clustered,scattered,randomised,random_string,small_vc,padding
                    columns.comments
                    columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
                    file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                    file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                    location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
                    name test.dummy
                    numFiles 35
                    numRows 100000000
                    orc.bloom.filter.columns ID
                    orc.bloom.filter.fpp 0.05
                    orc.compress SNAPPY
                    orc.create.index true
                    orc.row.index.stride 10000
                    orc.stripe.size 16777216
                    rawDataSize 33800000000
                    serialization.ddl struct dummy { i32 id, i32 clustered,
i32 scattered, i32 randomised, varchar(50) random_string, varchar(10)
small_vc, varchar(10) padding}
                    serialization.format 1
                    serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
                    totalSize 5660813776
                    transient_lastDdlTime 1454234981
                  serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                    input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                    output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                    properties:
                      COLUMN_STATS_ACCURATE true
                      bucket_count 256
                      bucket_field_name id
                      columns
id,clustered,scattered,randomised,random_string,small_vc,padding
                      columns.comments
                      columns.types
int:int:int:int:varchar(50):varchar(10):varchar(10)
                      file.inputformat
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      file.outputformat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                      location
hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy
                      name test.dummy
                      numFiles 35
                      numRows 100000000
                      orc.bloom.filter.columns ID
                      orc.bloom.filter.fpp 0.05
                      orc.compress SNAPPY
                      orc.create.index true
                      orc.row.index.stride 10000
                      orc.stripe.size 16777216
                      rawDataSize 33800000000
                      serialization.ddl struct dummy { i32 id, i32
clustered, i32 scattered, i32 randomised, varchar(50) random_string,
varchar(10) small_vc, varchar(10) padding}
                      serialization.format 1
                      serialization.lib
org.apache.hadoop.hive.ql.io.orc.OrcSerde
                      totalSize 5660813776
                      transient_lastDdlTime 1454234981
                    serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                    name: test.dummy
                  name: test.dummy
            Truncated Path -> Alias:
              /test.db/dummy [dummy]
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.081 seconds, Fetched: 140 row(s)

This line

                TableScan

is very misleading as soon as I see it, it gives the impression that it is
going through the whole table.

Somehow it decided to divide the table into two chunks of 50 Million rows
each. I have set "set hive.optimize.ppd=true;" as well but still it does
not appear to me if it is using statistics. May be I am missing something
here.

It is possible that there are certain parameters that should be turned on
when using ORC table?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 14:18, Gopal Vijayaraghavan <gop...@apache.org> wrote:

>
> > I have tried  bloom filter ,but it makes no improvement。I know about
> > tez, but never use, I will try it later.
> ...
> >    select count(*) from gprs where terminal_type=25080;
> >   will not scan data
> >      Time taken: 353.345 seconds
>
> CombineInputFormat does not do any split-elimination, so MapReduce does
> not get container speedups there.
>
> Most of your ~300s looks to be the fixed overheads of setting up each task.
>
> We could not fix this in MRv2 due to historical compatibility issues with
> merge-joins & schema evolution (see HiveSplitGenerator.java).
>
> This is not recommended for regular use (other than in Tez), but you can
> force split-elimination with
>
>
> set hive.input.format=${hive.tez.input.format};
>
> >>>> So,  has anyone used ORC's build-in indexes before (especially in
> >>>>spark SQL)?  What's my issue?
>
> We work on SparkSQL perf issues as well - this has to do with OrcRelation
>
> https://github.com/apache/spark/pull/10938
>
> +
> https://github.com/apache/spark/pull/10842
>
>
> Cheers,
> Gopal
>
>
>

Re: The build-in indexes in ORC file does not work.

Reply via email to