How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?
> On 16 Mar 2016, at 11:23, Joseph <wxy81...@sina.com> wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and row level. > The file and stripe level statistics are in the file footer so that they are > easy to access to determine if the rest of the file needs to be read at all. > Row level indexes include both column statistics for each row group and > position for seeking to the start of the row group. > > The following is my understanding: > 1. The file and stripe level indexes are forcibly generated, we can not > control them. > 2. The row level indexes can be configured by "orc.create.index"(whether to > create row indexes) and "orc.row.index.stride"(number of rows between index > entries). > 3. Each Index has statistics of min, max for each column, so sort data by the > filter column will bring better performance. > 4. To use any one of the three level of indexes,we should enable predicate > push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or > hive.optimize.ppd=true (in hive). > > But I found the build-in indexes in ORC files did not work both in spark > 1.5.2 and hive 1.2.1: > First, when the query statement with where clause did't match any record (the > filter column had a value beyond the range of data), the performance when > enabled predicate push-down was almost the same with when disabled predicate > push-down. I think, when the filter column has a value beyond the range of > data, all of the orc files will not be scanned if use file level indexes, so > the performance should improve obviously. > > The second, when enabled "orc.create.index" and sorted data by filter column > and where clause can only match a few records, the performance when enabled > predicate push-down was almost the same with when disabled predicate > push-down. > > The third, when enabled predicate push-down and "orc.create.index", the > performance when filter column had a value beyond the range of data was > almost the same with when filter column had a value covering almost the whole > data. > > So, has anyone used ORC's build-in indexes before (especially in spark SQL)? > What's my issue? > > Thanks! > > Joseph