Right, you concern is that you expect storeindex in ORC file to help the
optimizer.

Frankly I do not know what
write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under
the bonnet. From my experience in order for ORC index to be used you need
to bucket the table. I have explained these before in here
<https://www.linkedin.com/pulse/apache-hive-data-warehouse-proposal-improve-external-mich?trk=pulse_spock-articles>

Now it is possible that you have not updated statistics on the table

Even with Spark I tend to create my ORC table explicitly through Spark SQL.

You stated the join scans all the underlying ORC table. Your "id" column I
assume is unique. So I would bucket it using id column.


HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 June 2016 at 07:07, Mohanraj Ragupathiraj <mohanaug...@gmail.com>
wrote:

> Hi Mich,
>
> Thank you for your reply.
>
> Let me explain more clearly.
>
> File with 100 records needs to joined with a Big lookup File created in
> ORC format (500 million records). The Spark process i wrote is returing
> back the matching records and is working fine. My concern is that it loads
> the entire file (500 million) and matches with the 100 records instead of
> loading only the stripes with matching keys. I read that ORC file provides
> indexes (https://orc.apache.org/docs/indexes.html) and i assumned that
> when i join using Dataframes, the indexes will be used, resulting in
> loading of only matching records/stripes for processing instead of the
> whole table.
>
> On Mon, Jun 20, 2016 at 1:00 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> To start when you store the data in ORC file can you verify that the data
>> is there?
>>
>> For example register it as tempTable
>>
>> processDF.register("tmp")
>> sql("select count(1) from tmp).show
>>
>> Also what do you mean by index file in ORC?
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 20 June 2016 at 05:01, Mohanraj Ragupathiraj <mohanaug...@gmail.com>
>> wrote:
>>
>>> I am trying to join a Dataframe(say 100 records) with an ORC file with
>>> 500 million records through Spark(can increase to 4-5 billion, 25 bytes
>>> each record).
>>>
>>> I used Spark hiveContext API.
>>>
>>> *ORC File Creation Code*
>>>
>>> //fsdtRdd is JavaRDD, fsdtSchema is StructType schema
>>> DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
>>> fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");
>>>
>>> *ORC File Reading Code*
>>>
>>> HiveContext hiveContext = new HiveContext(sparkContext);
>>> DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
>>> // allRecords is dataframe
>>> DataFrame processDf = 
>>> allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
>>> processDf.show();
>>>
>>> When I read the ORC file, the get following in my Spark Logs:
>>>
>>> Input split: 
>>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min
>>>  key = null, max key = null*
>>> Reading ORC rows from 
>>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc
>>>  with {include: [true, true, true], offset: 0, length: 9223372036854775807}
>>> Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
>>> Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 
>>> 56,PROCESS_LOCAL, 2220 bytes)
>>> Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
>>> Running task 56.0 in stage 2.0 (TID 60)
>>>
>>> Although the Spark job completes successfully, I think, its not able to
>>> utilize ORC index file capability and thus checks through entire block of
>>> ORC data before moving on.
>>>
>>> *Question*
>>>
>>> -- Is it a normal behaviour, or I have to set any configuration before
>>> saving the data in ORC format?
>>>
>>> -- If it is *NORMAL*, what is the best way to join so that we discrad
>>> non-matching records on the disk level(maybe only the index file for ORC
>>> data is loaded)?
>>>
>>
>>
>
>
> --
> Thanks and Regards
> Mohan
> VISA Pte Limited, Singapore.
>

Reply via email to