Yes, I tested with smaller data sets and the MR job correctly reads/matches
one line at a time.
On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX dsui...@rdx.com wrote:
So, perhaps this has been thought of, but perhaps not.
It is my understanding that grep is usually sorting things one line at
There are a few reasons to use map/reduce, or just map-only or
reduce-only jobs.
1) You want to do parallel algorithms where data from multiple machines
have to be cross-checked. Map/Reduce allows this.
2) You want to run several instances of a job. Hadoop does this reliably
by monitoring all
Yep, have several tens of terabytes of data that will easily be over couple
of hundred TB in a year. Now it isn't as if I have one or two use cases to
run on these data sets. I need to run simple aggregation like counting,
averaging to more advanced analytics. I also need to be able to search
So, perhaps this has been thought of, but perhaps not.
It is my understanding that grep is usually sorting things one line at a
time. As I am currently experimenting with Avro, I am finding that the
local grep function does not handle it well at all, because it is one long
line essentially, so
Hi,
I have a simple Grep job (from bundled examples) that I am running on a
11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
When I run the Grep job, I notice that CPU gets pegged to 100% on multiple
Actually... I believe that is expected behavior. Since your CPU is pegged
at 100% you're not going to be IO bound. Typically jobs tend to be CPU
bound or IO bound. If you're CPU bound you expect to see low IO throughput.
If you're IO bound, you expect to see low CPU usage.
On Thu, Oct 10, 2013
Thanks Pradeep. Does it mean this job is a bad candidate for MR?
Interestingly, running the cmdline '/bin/grep' under a streaming job
provides (1) Much better disk throughput and, (2) CPU load is almost evenly
spread across all cores/threads (no CPU gets pegged to 100%).
On Thu, Oct 10, 2013
I don't think it necessarily means that the job is a bad candidate for MR.
It's a different type of a workload. Hortonworks has a great article on the
different types of workloads you might see and how that affects your
provisioning choices at
On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota pradeep...@gmail.comwrote:
I don't think it necessarily means that the job is a bad candidate for MR.
It's a different type of a workload. Hortonworks has a great article on the
different types of workloads you might see and how that affects