How much data are you querying? What is the query? How selective it is supposed 
to be? What is the block size?

> On 16 Mar 2016, at 11:23, Joseph <wxy81...@sina.com> wrote:
> 
> Hi all,
> 
> I have known that ORC provides three level of indexes within each file, file 
> level, stripe level, and row level. 
> The file and stripe level statistics are in the file footer so that they are 
> easy to access to determine if the rest of the file needs to be read at all. 
> Row level indexes include both column statistics for each row group and 
> position for seeking to the start of the row group. 
> 
> The following is my understanding:
> 1. The file and stripe level indexes are forcibly generated, we can not 
> control them.
> 2. The row level indexes can be configured by "orc.create.index"(whether to 
> create row indexes) and "orc.row.index.stride"(number of rows between index 
> entries).
> 3. Each Index has statistics of min, max for each column, so sort data by the 
> filter column will bring better performance.
> 4. To use any one of the three level of indexes,we should enable predicate 
> push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or 
> hive.optimize.ppd=true (in hive).
> 
> But I found the  build-in indexes in ORC files did not work both in spark 
> 1.5.2 and hive 1.2.1:
> First, when the query statement with where clause did't match any record (the 
> filter column had a value beyond the range of data),  the performance when 
> enabled  predicate push-down was almost the same with when disabled predicate 
> push-down.  I think, when the filter column has a value beyond the range of 
> data, all of the orc files will not be scanned if use file level indexes,  so 
> the performance should improve obviously.
> 
> The second, when enabled "orc.create.index" and sorted data by filter column 
> and where clause can only match a few records, the performance when enabled  
> predicate push-down was almost the same with when disabled predicate 
> push-down. 
> 
> The third, when enabled  predicate push-down and "orc.create.index", the 
> performance when  filter column had a value beyond the range of data was 
> almost the same with when filter column had a value covering almost the whole 
> data. 
> 
> So,  has anyone used ORC's build-in indexes before (especially in spark SQL)? 
>  What's my issue?
> 
> Thanks!
> 
> Joseph

Reply via email to