Re: HFileInputFormat for MapReduce

Tim Robertson Fri, 10 Feb 2012 03:22:16 -0800

> Is HIVE involved?  Or is it just raw scan compared to TFIF?

No Hive



> Is this a MR scan or just a shell serial scan (or is it still PE?)?

We are using PE scan to try and "standardize" as much as possible.

> You want to get this scan speed up only?  You are not interested in figuring 
> how
> to get the throughput up? (More regionservers and mappers?)

At this point, we are trying to determine if HBase could serve full
scans (with more hardware), which is one of our primary access
patterns. Our random write needs are low but random read is also
important.  Scan is the only concern at the moment.

> + You doing any scan caching (See over in this section
> http://hbase.apache.org/book.html#perf.configurations)

Now added, as PE was using the default of 1 - see below though

> + Do you set no cache when you are scanning blocks? (setCacheBlocks in
> API so you avoid putting blocks into cache causing extra GC churn).

Yes

> + Can you go to u3?

Doing that as I write

> + You seem to be letting hbase do its major compaction once a day.  Is
> this running when you are scanning (Suggestion in manual is to manage
> these yourself).

No, but noted - thanks

> + Up your xceivers

Done, now 4096 on the cluster

> + IIRC you have compression on (good)

Not for PE, but we use Snappy for our real tables


> When the scan is running how are the machines doing?  What do you see?
>  (Did you get ganglia going?  Does that show you anything
> interesting?)  If you look in UI, whats it claim to be doing
> generally?  How many regions in your table?

Since CDH3u3 is ongoing as I type, I'm not sure on the regions (<50
regions on 3 RS with the PE TestTable).

I have not digested Ganglia yet, but I just ran the PE scan 10 with
scanner cache sizes of 1,10,30,100,1000,10000.  Worryingly the
performance was the same regardless of the cache size.

$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation scan 10
12/02/10 11:01:21 INFO mapred.JobClient:     ROWS=10485700
12/02/10 11:01:21 INFO mapred.JobClient:     ELAPSED_TIME=1611746

I captured the full ganglia for an RS during this if anyone can spot
anything obvious (I am about to try and understand this myself):
  http://dl.dropbox.com/u/608155/cacheSize-RS.png

> If you major compact before, do you see much difference (or just look
> in the filesystem and see how many files there are generally under the
> table you are scanning?  Lots per column family or generally one -- if
> the latter its mostly major compacted).

Can't check right now (upgrading), but I did compact before tests.

Re: HFileInputFormat for MapReduce

Reply via email to