Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Mich Talebzadeh
I did some tests on Hive running on MR to get rid of Spark effects. In an ORC table that has been partitioned, partition elimination with predicate push down works and the query is narrowed to the partition itself. I can see that from the number of rows within that partition. For example below

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and

RE: The build-in indexes in ORC file does not work.

2016-03-19 Thread Wietsma, Tristan A.
Regarding bloom filters, https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417 Sent with Good (www.good.com) From: Joseph Sent: Wednesday, March 16, 2016 9:46:25 AM To: user Cc: user; user Subject: Re: Re: The build-in

Re: Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Joseph
terminal_type =0, 260,000,000 rows, almost cover half of the whole data.terminal_type =25066, just 3800 rows. orc

Re: The build-in indexes in ORC file does not work.

2016-03-18 Thread Jörn Franke
Not sure it should work. How many rows are affected? The data is sorted? Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext. Perhaps a bloom filter could make sense for you as well. > On 16 Mar 2016, at 12:45, Joseph

Re: The build-in indexes in ORC file does not work.

2016-03-16 Thread Mich Talebzadeh
Hi, The parameters that control the stripe, row group are configurable via the ORC creation script CREATE TABLE dummy ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) CLUSTERED BY (ID)

The build-in indexes in ORC file does not work.

2016-03-16 Thread Joseph
Hi all, I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes