Re: Improving MR job disk IO

Lance Norskog Mon, 14 Oct 2013 20:18:18 -0700

There are a few reasons to use map/reduce, or just map-only orreduce-only jobs.1) You want to do parallel algorithms where data from multiple machineshave to be cross-checked. Map/Reduce allows this.2) You want to run several instances of a job. Hadoop does this reliablyby monitoring all instances, restarting failed ones, etc.

3) You have way too much data to fit on one computer. Same as #2.


You might not need Hadoop if you can run your programs without it.

Lance

On 10/14/2013 08:02 PM, Xuri Nagarin wrote:

Yes, I tested with smaller data sets and the MR job correctlyreads/matches one line at a time.

On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <dsui...@rdx.com<mailto:dsui...@rdx.com>> wrote:

So, perhaps this has been thought of, but perhaps not.

It is my understanding that grep is usually sorting things one
line at a time. As I am currently experimenting with Avro, I am
finding that the local grep function does not handle it well at
all, because it is one long line essentially, so working from
local Avro, grep does not do well at pattern matching, it just
returns the whole file as a match, and it takes a long time to
view it in vi editor as well since there are no EOL markers.

If you have modified for sequence file, are you reading a sequence
file that has newline characters? If not, perhaps the file is
being read as one whole line, causing some unexpected effects.

Thanks,
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 <tel:412-256-8556> | www.rdx.com
<http://www.rdx.com/>

On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <secs...@gmail.com
<mailto:secs...@gmail.com>> wrote:

On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota
<pradeep...@gmail.com <mailto:pradeep...@gmail.com>> wrote:

I don't think it necessarily means that the job is a bad
candidate for MR. It's a different type of a workload.
Hortonworks has a great article on the different types of
workloads you might see and how that affects your
provisioning choices at

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html

One statement that stood out to me in the link above is "For
these reasons, Hortonworks recommends that you either use the
Balanced workload configuration or invest in a pilot Hadoop
cluster and plan to evolve as you analyze the workload
patterns in your environment."

Now, this is not a critique/concern of HW but rather of
hadoop. Well, what if my workloads can be both CPU and IO
intensive? Do I take the approach of
throw-enough-excess-hardware-just-in-case?

I have not looked at the Grep code so I'm not sure why
it's behaving the way it is. Still curious that streaming
has a higher IO throughput and lower CPU usage. It may
have to do with the fact that /bin/grep is a native
implementation and Grep (Hadoop) is probably using Java
Pattern/Matcher api.

The Grep code is from the bundled examples in CDH. I made one
line modification for it to read Sequence files. The streaming
job probably does not have lower CPU utilization but I see
that it does even out the CPU utilization among all the
available processors. I guess the native grep binary threads
better than the java MR job?

Which brings me to ask - If you have the mapper/reducer
functionality built into a platform specific binary, then
won't it always be more efficient than a java MR job? And, in
such cases, am I better off with streaming than Java MR?

Thanks for your responses.

On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin
<secs...@gmail.com <mailto:secs...@gmail.com>> wrote:

Thanks Pradeep. Does it mean this job is a bad
candidate for MR?

Interestingly, running the cmdline '/bin/grep' under a
streaming job provides (1) Much better disk throughput
and, (2) CPU load is almost evenly spread across all
cores/threads (no CPU gets pegged to 100%).

On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota
<pradeep...@gmail.com <mailto:pradeep...@gmail.com>>
wrote:

Actually... I believe that is expected behavior.
Since your CPU is pegged at 100% you're not going
to be IO bound. Typically jobs tend to be CPU
bound or IO bound. If you're CPU bound you expect
to see low IO throughput. If you're IO bound, you
expect to see low CPU usage.

On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin
<secs...@gmail.com <mailto:secs...@gmail.com>> wrote:

Hi,

I have a simple Grep job (from bundled
examples) that I am running on a 11-node
cluster. Each node is 2x8-core Intel Xeons
(shows 32 CPUs with HT on), 64GB RAM and 8 x
1TB disks. I have mappers set to 20 per node.

When I run the Grep job, I notice that CPU
gets pegged to 100% on multiple cores but disk
throughput remains a dismal 1-2 Mbytes/sec on
a single disk on each node. So I guess, the
cluster is poorly performing in terms of disk
IO. Running Terasort, I see each disk puts out
25-35 Mbytes/sec with a total cluster
throughput of above 1.5 Gbytes/sec.

How do I go about re-configuring or re-writing
the job to utilize maximum disk IO?

TIA,

Xuri

Re: Improving MR job disk IO

Reply via email to