Hi, I have inserted around 700 million records or 55 GB worth data from staging table to below ORC table. And then I tried running small queries to get few columns data from ORC table. But ORC query performance is slower than staging table query performance. I am not sure where I am doing mistake.It's doing table scan and also looks like input/output format is text. I tried with bucketing without bucketing. Let me know how can I improve performance. My queries are like where SRC_IP ="xxxx", where SRC_IP ="xxxx" and DST_IP ="xxxx" for a given tag and for a specific ts value. The data volume expected to grow to TBs.
ORC table schema: CREATE EXTERNAL TABLE RAW_DATA_ORC_TEST2(TAG INT,SRC_IP STRING ,DST_IP STRING ,SRC_PORT INT ,DST_PORT INT , BYTES BIGINT )PARTITIONED BY (TS BIGINT ) CLUSTERED BY(TAG) SORTED BY (TAG) INTO 100 BUCKETS STORED AS ORC; insert into table raw_data_orc_test2 partition(ts) select TAG,SRC_IP,DST_IP,SRC_PORT,DST_PORT,BYTES,TS from raw_data_by_epoch_from_ny where ts=11 and PEER_SRC_IP != "PEER_SRC_IP" sort by tag; hive (default)> explain select TAG,PROTOCOL,TS from raw_data_orc_test2 where ts=11; OK Explain ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME raw_data_orc_test2))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL TAG)) (TOK_SELEXPR (TOK_TABLE_OR_COL PROTOCOL)) (TOK_SELEXPR (TOK_TABLE_OR_COL TS))) (TOK_WHERE (= (TOK_TABLE_OR_COL ts) 11)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: raw_data_orc_test2 TableScan alias: raw_data_orc_test2 Select Operator expressions: expr: tag type: int expr: protocol type: string expr: ts type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 Thanks, Chandra