On Thu, Feb 9, 2012 at 3:00 PM, Tim Robertson <[email protected]> wrote: > Hey Stack, > > We see the difference between a scan and TextFileInputFormat of the > same data as csv being 10x slower. This is what prompted me to look > at MR using an HFIF just out of curiosity. >
Is HIVE involved? Or is it just raw scan compared to TFIF? Is this a MR scan or just a shell serial scan (or is it still PE?)? You want to get this scan speed up only? You are not interested in figuring how to get the throughput up? (More regionservers and mappers?) Looking at your published configs (sorry if I'm repeating myself): + You doing any scan caching (See over in this section http://hbase.apache.org/book.html#perf.configurations) + Do you set no cache when you are scanning blocks? (setCacheBlocks in API so you avoid putting blocks into cache causing extra GC churn). + Can you go to u3? It has improvements all around (including the go-local short-circuit feature -- see section 3.2.5 here http://hbase.apache.org/book.html#upgrade0.92) Miscellaneous: + You seem to be letting hbase do its major compaction once a day. Is this running when you are scanning (Suggestion in manual is to manage these yourself). + Up your xceivers + IIRC you have compression on (good) When the scan is running how are the machines doing? What do you see? (Did you get ganglia going? Does that show you anything interesting?) If you look in UI, whats it claim to be doing generally? How many regions in your table? If you major compact before, do you see much difference (or just look in the filesystem and see how many files there are generally under the table you are scanning? Lots per column family or generally one -- if the latter its mostly major compacted). St.Ack
