Hi Guillermo,

Yes, you are missing something.

TableInputFormat uses the Scan API just like Spark would.

Bypassing the RegionServer and reading from HFiles directly is accomplished by using the TableSnapshotInputFormat. You can only read from HFiles directly when you are using a Snapshot, as there are concurrency issues WRT the lifecycle of HFiles managed by HBase. It is not safe to try to HFiles underneath HBase on your own unless you are confident you understand all the edge cases in how HBase manages files.

On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
Just to be sure, if I execute Scan inside Spark, the execution is goig
through RegionServers and I get all the features of HBase/Scan (filters and
so on), all the parallelization is in charge of the RegionServers (even
I'm  running the program with spark)
If I use TableInputFormat I read all the column families (even If I don't
want to) , not previous filter either, it's just open the files of a hbase
table and process them completly. All te parallelization is in Spark and
don't use HBase at all, it's just read in HDFS the files what HBase stored
for a specific table.

Am I missing something?

Reply via email to