Hi,

The parameters that control the stripe,  row group are configurable via the
ORC creation script

CREATE TABLE dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
So in here I make my stripe quite small 16MB (as opposed to default of
64MB) and give row.index.stride = 10000.

You can find out the available stats at row group for various
columns (0,1,2,3,...) by doing something like below

hive --orcfiledump --rowindex 0,1,2,3,4,5,6
/user/hive/warehouse/test.db/dummy/000000_0

In reality I have found out that the only occasion the stats are used
is when you actually bucket the table in ORC or use partition. There are
also dependencies on the block size etc as well and how many rows in each
block. If the whole table fits in a block size I believe the stats are
ignored (at least this was the case in older versions of Hive (I use Hive 2)

check the optimiser plan with

explain extended   <YOUR_CODE>

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 10:23, Joseph <wxy81...@sina.com> wrote:

> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are 
> easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and 
> position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not 
> control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to 
> create row indexes) and "orc.row.index.stride"(number of rows between index 
> entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the 
> filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate 
> push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the 
> filter column had a value beyond the range of data),  the performance when 
> enabled
>  predicate push-down was almost the same with when disabled predicate 
> push-down.  I think, when the filter column has a value beyond the range of 
> data, all of the orc files will not be scanned if use file level indexes,  so 
> the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when 
> enabled
>  predicate push-down was almost the same with when disabled predicate 
> push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the 
> performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> ------------------------------
> Joseph
>

Reply via email to