Hi,

Using sparkHiveContext when we read all rows where age was between 0 and
100, even though we requested rows where age was less than 15. Such full
table scanning is an expensive operation.

ORC avoids this type of overhead by using predicate push-down with three
levels of built-in indexes within each file: file level, stripe level, and
row level:

   -

   File and stripe level statistics are in the file footer, making it easy
   to determine if the rest of the file needs to be read.
   -

   Row level indexes include column statistics for each row group and
   position, for seeking to the start of the row group.

ORC utilizes these indexes to move the filter operation to the data loading
phase, by reading only data that potentially includes required rows.


My doubt is when we give some query to hiveContext in orc table using spark
with

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

how it will perform

1.it will fetch only those record from orc file according to query.or

2.it will take orc file in spark and then perform spark job using
predicate push-down

and give you the records.

(I am aware of hiveContext gives spark only metadata and location of the data)


Thanks

Manish

Reply via email to