See below more I found on item 3.

Cheers.

----
Saad

On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Hi,
>
> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
> We are reading the snapshot info from a HBase folder in S3 using
> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
> snapshot info directly from the S3 based filesystem instead of going
> through any region server.
>
> I have observed a few behaviors while debugging performance that are
> concerning, some we could mitigate and other I am looking for clarity on:
>
> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> information for the region splits, for a snapshots with a large number of
> files (over 350000 in our case) this causing single threaded scan of all
> the file listings in a single thread in the driver. And it was useless
> because there is really no useful locality information to glean since all
> the files are in S3 and not HDFS. So I was forced to make a copy of
> TableSnapshotInputFormatImpl.java in our code and control this with a
> config setting I made up. That got rid of the hours long scan, so I am good
> with this part for now.
>
> 2) I have set a single column family in the Scan that I set on the hbase
> configuration via
>
> scan.addFamily(str.getBytes()))
>
> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>
>
> But when this code is executing under Spark and I observe the threads and
> logs on Spark executors, I it is reading from S3 files for a column family
> that was not included in the scan. This column family was intentionally
> excluded because it is much larger than the others and so we wanted to
> avoid the cost.
>
> Any advice on what I am doing wrong would be appreciated.
>
> 3) We also explicitly set caching of blocks to false on the scan, although
> I see that in TableSnapshotInputFormatImpl.java it is again set to false
> internally also. But when running the Spark job, some executors were taking
> much longer than others, and when I observe their threads, I see periodic
> messages about a few hundred megs of RAM used by the block cache, and the
> thread is sitting there reading data from S3, and is occasionally blocked a
> couple of other threads that have the "hfile-prefetcher" name in them.
> Going back to 2) above, they seem to be reading the wrong column family,
> but in this item I am more concerned about why they appear to be
> prefetching blocks and caching them, when the Scan object has a setting to
> not cache blocks at all?
>

I think I figured out item 3, the column family descriptor for the table in
question has prefetch on open set in its schema. Now for the Spark job, I
don't think this serves any useful purpose does it? But I can't see any way
to override it. If these is, I'd appreciate some advice.

Thanks.


>
> Thanks in advance for any insights anyone can provide.
>
> ----
> Saad
>
>

Reply via email to