Hi, Using sparkHiveContext when we read all rows where age was between 0 and 100, even though we requested rows where age was less than 15. Such full table scanning is an expensive operation.
ORC avoids this type of overhead by using predicate push-down with three levels of built-in indexes within each file: file level, stripe level, and row level: - File and stripe level statistics are in the file footer, making it easy to determine if the rest of the file needs to be read. - Row level indexes include column statistics for each row group and position, for seeking to the start of the row group. ORC utilizes these indexes to move the filter operation to the data loading phase, by reading only data that potentially includes required rows. My doubt is when we give some query to hiveContext in orc table using spark with sqlContext.setConf("spark.sql.orc.filterPushdown", "true") how it will perform 1.it will fetch only those record from orc file according to query.or 2.it will take orc file in spark and then perform spark job using predicate push-down and give you the records. (I am aware of hiveContext gives spark only metadata and location of the data) Thanks Manish