Re: Implementation of full table scan using Spark

Ted Yu Wed, 28 Jun 2017 20:34:35 -0700

TableInputFormat doesn't read memstore.

bq. I am inserting 10-20 entires only


You can query JMX and check the values for the following:

flushedCellsCount
flushedCellsSize

FlushMemstoreSize_num_ops

For Q2, there is no client side support for knowing where the data comes
from.

On Wed, Jun 28, 2017 at 8:15 PM, Sachin Jain <sachinjain...@gmail.com>
wrote:

> Hi,
>
> I have used TableInputFormat and newAPIHadoopRDD defined on sparkContext to
> do a full table scan and get an rdd from it.
>
> Partial piece of code looks like this:
>
> sparkContext.newAPIHadoopRDD(
>   HBaseConfigurationUtil.hbaseConfigurationForReading(table.getName.
> getNameWithNamespaceInclAsString,
> hbaseQuorum, hBaseFilter, versionOpt, zNodeParentOpt),
>   classOf[TableInputFormat],
>   classOf[ImmutableBytesWritable],
>   classOf[Result]
> )
>
>
> As per my understanding this full table scan works fast because we are
> reading Hfiles directly.
>
> *Q1. Does that mean we are skipping memstores ? *If yes, then we should
> have missed some data which is present in memstore because that data has
> not been persisted to disk yet and hence not available via HFile.
>
> *In my local setup, I always get all the data*. Since I am inserting 10-20
> entires only I am assuming this is present in memstore when I am issuing
> the full table scan spark job.
>
> Q2. When I issue a get command, Is there a way to know if the record is
> served from blockCache, memstore or Hfile?
>
> Thanks
> -Sachin
>

Re: Implementation of full table scan using Spark

Reply via email to