I did some tests on Hive running on MR to get rid of Spark effects.
In an ORC table that has been partitioned, partition elimination with
predicate push down works and the query is narrowed to the partition
itself. I can see that from the number of rows within that partition.
For example below
How much data are you querying? What is the query? How selective it is supposed
to be? What is the block size?
> On 16 Mar 2016, at 11:23, Joseph wrote:
>
> Hi all,
>
> I have known that ORC provides three level of indexes within each file, file
> level, stripe level, and
Regarding bloom filters,
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417
Sent with Good (www.good.com)
From: Joseph
Sent: Wednesday, March 16, 2016 9:46:25 AM
To: user
Cc: user; user
Subject: Re: Re: The build-in
terminal_type =0, 260,000,000 rows, almost cover half of the whole
data.terminal_type =25066, just 3800 rows.
orc
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you
use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.
> On 16 Mar 2016, at 12:45, Joseph
Hi,
The parameters that control the stripe, row group are configurable via the
ORC creation script
CREATE TABLE dummy (
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(10)
, PADDING VARCHAR(10)
)
CLUSTERED BY (ID)
Hi all,
I have known that ORC provides three level of indexes within each file, file
level, stripe level, and row level.
The file and stripe level statistics are in the file footer so that they are
easy to access to determine if the rest of the file needs to be read at all.
Row level indexes