ORC performance

chandra Reddy Bogala Mon, 30 Dec 2013 22:34:53 -0800

Hi,
  I have inserted around 700 million records or 55 GB worth data from
staging table to below ORC table. And then I tried running small queries to
get few columns data from ORC table. But ORC query performance is slower
than staging table query performance. I am not sure where I am doing
mistake.It's doing table scan and also looks like input/output format is
text. I tried with bucketing without bucketing. Let me know how can I
improve performance. My queries are like where SRC_IP ="xxxx", where SRC_IP
="xxxx" and DST_IP ="xxxx" for a given tag and for a specific ts value. The
data volume expected to grow to TBs.


ORC table schema:
CREATE EXTERNAL  TABLE RAW_DATA_ORC_TEST2(TAG INT,SRC_IP STRING ,DST_IP
STRING ,SRC_PORT INT ,DST_PORT INT , BYTES BIGINT )PARTITIONED BY (TS
BIGINT  )  CLUSTERED BY(TAG) SORTED BY (TAG) INTO 100 BUCKETS  STORED AS
ORC;


insert into table raw_data_orc_test2 partition(ts) select
TAG,SRC_IP,DST_IP,SRC_PORT,DST_PORT,BYTES,TS  from
raw_data_by_epoch_from_ny where ts=11 and PEER_SRC_IP != "PEER_SRC_IP" sort
by tag;





hive (default)> explain select  TAG,PROTOCOL,TS from raw_data_orc_test2
where ts=11;

OK

Explain

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME raw_data_orc_test2)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL TAG)) (TOK_SELEXPR (TOK_TABLE_OR_COL
PROTOCOL)) (TOK_SELEXPR (TOK_TABLE_OR_COL TS))) (TOK_WHERE (=
(TOK_TABLE_OR_COL ts) 11))))



STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage



STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Alias -> Map Operator Tree:

        raw_data_orc_test2

          TableScan

            alias: raw_data_orc_test2

            Select Operator

              expressions:

                    expr: tag

                    type: int

                    expr: protocol

                    type: string

                    expr: ts

                    type: string

              outputColumnNames: _col0, _col1, _col2

              File Output Operator

                compressed: true

                GlobalTableId: 0

                table:

                    input format: org.apache.hadoop.mapred.TextInputFormat

                    output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat



  Stage: Stage-0

    Fetch Operator

      limit: -1


Thanks,
Chandra

ORC performance

Reply via email to