The question remain though of why it is even accessing a column family's
files that should be excluded based on the Scan. And that column family
does NOT specify prefetch on open in its schema. Only the one we want to
read specifies prefetch on open, which we want to override if possible for
the Spark job.

----
Saad

On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> See below more I found on item 3.
>
> Cheers.
>
> ----
> Saad
>
> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
>> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
>> We are reading the snapshot info from a HBase folder in S3 using
>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>> snapshot info directly from the S3 based filesystem instead of going
>> through any region server.
>>
>> I have observed a few behaviors while debugging performance that are
>> concerning, some we could mitigate and other I am looking for clarity on:
>>
>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>> information for the region splits, for a snapshots with a large number of
>> files (over 350000 in our case) this causing single threaded scan of all
>> the file listings in a single thread in the driver. And it was useless
>> because there is really no useful locality information to glean since all
>> the files are in S3 and not HDFS. So I was forced to make a copy of
>> TableSnapshotInputFormatImpl.java in our code and control this with a
>> config setting I made up. That got rid of the hours long scan, so I am good
>> with this part for now.
>>
>> 2) I have set a single column family in the Scan that I set on the hbase
>> configuration via
>>
>> scan.addFamily(str.getBytes()))
>>
>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>
>>
>> But when this code is executing under Spark and I observe the threads and
>> logs on Spark executors, I it is reading from S3 files for a column family
>> that was not included in the scan. This column family was intentionally
>> excluded because it is much larger than the others and so we wanted to
>> avoid the cost.
>>
>> Any advice on what I am doing wrong would be appreciated.
>>
>> 3) We also explicitly set caching of blocks to false on the scan,
>> although I see that in TableSnapshotInputFormatImpl.java it is again set
>> to false internally also. But when running the Spark job, some executors
>> were taking much longer than others, and when I observe their threads, I
>> see periodic messages about a few hundred megs of RAM used by the block
>> cache, and the thread is sitting there reading data from S3, and is
>> occasionally blocked a couple of other threads that have the
>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>> reading the wrong column family, but in this item I am more concerned about
>> why they appear to be prefetching blocks and caching them, when the Scan
>> object has a setting to not cache blocks at all?
>>
>
> I think I figured out item 3, the column family descriptor for the table
> in question has prefetch on open set in its schema. Now for the Spark job,
> I don't think this serves any useful purpose does it? But I can't see any
> way to override it. If these is, I'd appreciate some advice.
>

> Thanks.
>
>
>>
>> Thanks in advance for any insights anyone can provide.
>>
>> ----
>> Saad
>>
>>
>
>

Reply via email to