Hi, I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is no Hbase installed on the cluster, only HBase libs linked to my Spark app. We are reading the snapshot info from a HBase folder in S3 using TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read snapshot info directly from the S3 based filesystem instead of going through any region server.
I have observed a few behaviors while debugging performance that are concerning, some we could mitigate and other I am looking for clarity on: 1) the TableSnapshotInputFormatImpl code is trying to get locality information for the region splits, for a snapshots with a large number of files (over 350000 in our case) this causing single threaded scan of all the file listings in a single thread in the driver. And it was useless because there is really no useful locality information to glean since all the files are in S3 and not HDFS. So I was forced to make a copy of TableSnapshotInputFormatImpl.java in our code and control this with a config setting I made up. That got rid of the hours long scan, so I am good with this part for now. 2) I have set a single column family in the Scan that I set on the hbase configuration via scan.addFamily(str.getBytes())) hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan)) But when this code is executing under Spark and I observe the threads and logs on Spark executors, I it is reading from S3 files for a column family that was not included in the scan. This column family was intentionally excluded because it is much larger than the others and so we wanted to avoid the cost. Any advice on what I am doing wrong would be appreciated. 3) We also explicitly set caching of blocks to false on the scan, although I see that in TableSnapshotInputFormatImpl.java it is again set to false internally also. But when running the Spark job, some executors were taking much longer than others, and when I observe their threads, I see periodic messages about a few hundred megs of RAM used by the block cache, and the thread is sitting there reading data from S3, and is occasionally blocked a couple of other threads that have the "hfile-prefetcher" name in them. Going back to 2) above, they seem to be reading the wrong column family, but in this item I am more concerned about why they appear to be prefetching blocks and caching them, when the Scan object has a setting to not cache blocks at all? Thanks in advance for any insights anyone can provide. ---- Saad