Re: Scan vs TableInputFormat to process data

Josh Elser Sat, 01 Jun 2019 18:44:09 -0700

Hi Guillermo,

Yes, you are missing something.


TableInputFormat uses the Scan API just like Spark would.

Bypassing the RegionServer and reading from HFiles directly isaccomplished by using the TableSnapshotInputFormat. You can only readfrom HFiles directly when you are using a Snapshot, as there areconcurrency issues WRT the lifecycle of HFiles managed by HBase. It isnot safe to try to HFiles underneath HBase on your own unless you areconfident you understand all the edge cases in how HBase manages files.


On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:

Just to be sure, if I execute Scan inside Spark, the execution is goig
through RegionServers and I get all the features of HBase/Scan (filters and
so on), all the parallelization is in charge of the RegionServers (even
I'm  running the program with spark)
If I use TableInputFormat I read all the column families (even If I don't
want to) , not previous filter either, it's just open the files of a hbase
table and process them completly. All te parallelization is in Spark and
don't use HBase at all, it's just read in HDFS the files what HBase stored
for a specific table.

Am I missing something?

Re: Scan vs TableInputFormat to process data

Reply via email to