Re: Apache Spark orc read performance when reading large number of small files

gpatcham Thu, 01 Nov 2018 10:42:33 -0700

When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count
with "spark.sql.orc.filterPushdown=true" I see below in executors logs.
Predicate push down is happening


18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 =
(IS_NULL conv_date)
leaf-1 = (EQUALS conv_date 20181025)
expr = (and (not leaf-0) leaf-1)


But when I run hive query in spark I see below logs

Hive table: Hive

spark.sql("select * from test where conv_date = 20181025").count

18/11/01 17:37:57 INFO HadoopRDD: Input split: hdfs://test/test1.orc:0+34568
18/11/01 17:37:57 INFO OrcRawRecordMerger: min key = null, max key = null
18/11/01 17:37:57 INFO ReaderImpl: Reading ORC rows from
hdfs://test/test1.orc with {include: [true, false, false, false, true,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false], offset: 0, length: 9223372036854775807}
18/11/01 17:37:57 INFO Executor: Finished task 224.0 in stage 0.0 (TID 33).
1662 bytes result sent to driver
18/11/01 17:37:57 INFO CoarseGrainedExecutorBackend: Got assigned task 40
18/11/01 17:37:57 INFO Executor: Running task 956.0 in stage 0.0 (TID 40)





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

Reply via email to