I am trying to join a Dataframe(say 100 records) with an ORC file with 500 million records through Spark(can increase to 4-5 billion, 25 bytes each record).
I used Spark hiveContext API. *ORC File Creation Code* //fsdtRdd is JavaRDD, fsdtSchema is StructType schema DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema); fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead"); *ORC File Reading Code* HiveContext hiveContext = new HiveContext(sparkContext); DataFrame orcFileData= hiveContext.read().orc("orcFileToRead"); // allRecords is dataframe DataFrame processDf = allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join"); processDf.show(); When I read the ORC file, the get following in my Spark Logs: Input split: file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min key = null, max key = null* Reading ORC rows from file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc with {include: [true, true, true], offset: 0, length: 9223372036854775807} Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 56,PROCESS_LOCAL, 2220 bytes) Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84) Running task 56.0 in stage 2.0 (TID 60) Although the Spark job completes successfully, I think, its not able to utilize ORC index file capability and thus checks through entire block of ORC data before moving on. *Question* -- Is it a normal behaviour, or I have to set any configuration before saving the data in ORC format? -- If it is *NORMAL*, what is the best way to join so that we discrad non-matching records on the disk level(maybe only the index file for ORC data is loaded)?