Also, keep in mind that by bypassing the RegionServer you also bypass the security rules...
JMS Le sam. 1 juin 2019 à 21:43, Josh Elser <[email protected]> a écrit : > Hi Guillermo, > > Yes, you are missing something. > > TableInputFormat uses the Scan API just like Spark would. > > Bypassing the RegionServer and reading from HFiles directly is > accomplished by using the TableSnapshotInputFormat. You can only read > from HFiles directly when you are using a Snapshot, as there are > concurrency issues WRT the lifecycle of HFiles managed by HBase. It is > not safe to try to HFiles underneath HBase on your own unless you are > confident you understand all the edge cases in how HBase manages files. > > On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote: > > Just to be sure, if I execute Scan inside Spark, the execution is goig > > through RegionServers and I get all the features of HBase/Scan (filters > and > > so on), all the parallelization is in charge of the RegionServers (even > > I'm running the program with spark) > > If I use TableInputFormat I read all the column families (even If I don't > > want to) , not previous filter either, it's just open the files of a > hbase > > table and process them completly. All te parallelization is in Spark and > > don't use HBase at all, it's just read in HDFS the files what HBase > stored > > for a specific table. > > > > Am I missing something? > > >
