Re: Scan vs TableInputFormat to process data

Jean-Marc Spaggiari Mon, 03 Jun 2019 06:17:09 -0700

Also, keep in  mind that by bypassing the RegionServer you also bypass the
security rules...


JMS

Le sam. 1 juin 2019 à 21:43, Josh Elser <[email protected]> a écrit :

> Hi Guillermo,
>
> Yes, you are missing something.
>
> TableInputFormat uses the Scan API just like Spark would.
>
> Bypassing the RegionServer and reading from HFiles directly is
> accomplished by using the TableSnapshotInputFormat. You can only read
> from HFiles directly when you are using a Snapshot, as there are
> concurrency issues WRT the lifecycle of HFiles managed by HBase. It is
> not safe to try to HFiles underneath HBase on your own unless you are
> confident you understand all the edge cases in how HBase manages files.
>
> On 5/29/19 2:54 AM, Guillermo Ortiz Fernández wrote:
> > Just to be sure, if I execute Scan inside Spark, the execution is goig
> > through RegionServers and I get all the features of HBase/Scan (filters
> and
> > so on), all the parallelization is in charge of the RegionServers (even
> > I'm  running the program with spark)
> > If I use TableInputFormat I read all the column families (even If I don't
> > want to) , not previous filter either, it's just open the files of a
> hbase
> > table and process them completly. All te parallelization is in Spark and
> > don't use HBase at all, it's just read in HDFS the files what HBase
> stored
> > for a specific table.
> >
> > Am I missing something?
> >
>

Re: Scan vs TableInputFormat to process data

Reply via email to