Hi,

I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
no Hbase installed on the cluster, only HBase libs linked to my Spark app.
We are reading the snapshot info from a HBase folder in S3 using
TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
snapshot info directly from the S3 based filesystem instead of going
through any region server.

I have observed a few behaviors while debugging performance that are
concerning, some we could mitigate and other I am looking for clarity on:

1)  the TableSnapshotInputFormatImpl code is trying to get locality
information for the region splits, for a snapshots with a large number of
files (over 350000 in our case) this causing single threaded scan of all
the file listings in a single thread in the driver. And it was useless
because there is really no useful locality information to glean since all
the files are in S3 and not HDFS. So I was forced to make a copy of
TableSnapshotInputFormatImpl.java in our code and control this with a
config setting I made up. That got rid of the hours long scan, so I am good
with this part for now.

2) I have set a single column family in the Scan that I set on the hbase
configuration via

scan.addFamily(str.getBytes()))

hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))


But when this code is executing under Spark and I observe the threads and
logs on Spark executors, I it is reading from S3 files for a column family
that was not included in the scan. This column family was intentionally
excluded because it is much larger than the others and so we wanted to
avoid the cost.

Any advice on what I am doing wrong would be appreciated.

3) We also explicitly set caching of blocks to false on the scan, although
I see that in TableSnapshotInputFormatImpl.java it is again set to false
internally also. But when running the Spark job, some executors were taking
much longer than others, and when I observe their threads, I see periodic
messages about a few hundred megs of RAM used by the block cache, and the
thread is sitting there reading data from S3, and is occasionally blocked a
couple of other threads that have the "hfile-prefetcher" name in them.
Going back to 2) above, they seem to be reading the wrong column family,
but in this item I am more concerned about why they appear to be
prefetching blocks and caching them, when the Scan object has a setting to
not cache blocks at all?

Thanks in advance for any insights anyone can provide.

----
Saad

Reply via email to