Re: HFileInputFormat for MapReduce

Stack Thu, 09 Feb 2012 22:55:16 -0800

On Thu, Feb 9, 2012 at 3:00 PM, Tim Robertson <[email protected]> wrote:
> Hey Stack,
>
> We see the difference between a scan and TextFileInputFormat of the
> same data as csv being 10x slower.  This is what prompted me to look
> at MR using an HFIF just out of curiosity.
>


Is HIVE involved?  Or is it just raw scan compared to TFIF?  Is this a
MR scan or just a shell serial scan (or is it still PE?)?  You want to
get this scan speed up only?  You are not interested in figuring how
to get the throughput up? (More regionservers and mappers?)

Looking at your published configs (sorry if I'm repeating myself):

+ You doing any scan caching (See over in this section
http://hbase.apache.org/book.html#perf.configurations)
+ Do you set no cache when you are scanning blocks? (setCacheBlocks in
API so you avoid putting blocks into cache causing extra GC churn).
+ Can you go to u3?  It has improvements all around (including the
go-local short-circuit feature -- see section 3.2.5 here
http://hbase.apache.org/book.html#upgrade0.92)

Miscellaneous:

+ You seem to be letting hbase do its major compaction once a day.  Is
this running when you are scanning (Suggestion in manual is to manage
these yourself).
+ Up your xceivers
+ IIRC you have compression on (good)

When the scan is running how are the machines doing?  What do you see?
 (Did you get ganglia going?  Does that show you anything
interesting?)  If you look in UI, whats it claim to be doing
generally?  How many regions in your table?

If you major compact before, do you see much difference (or just look
in the filesystem and see how many files there are generally under the
table you are scanning?  Lots per column family or generally one -- if
the latter its mostly major compacted).

St.Ack

Re: HFileInputFormat for MapReduce

Reply via email to