Re: Improving MR job disk IO

2013-10-14 Thread Xuri Nagarin
Yes, I tested with smaller data sets and the MR job correctly reads/matches one line at a time. On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX dsui...@rdx.com wrote: So, perhaps this has been thought of, but perhaps not. It is my understanding that grep is usually sorting things one line at

Re: Improving MR job disk IO

2013-10-14 Thread Lance Norskog
There are a few reasons to use map/reduce, or just map-only or reduce-only jobs. 1) You want to do parallel algorithms where data from multiple machines have to be cross-checked. Map/Reduce allows this. 2) You want to run several instances of a job. Hadoop does this reliably by monitoring all

Re: Improving MR job disk IO

2013-10-14 Thread Xuri Nagarin
Yep, have several tens of terabytes of data that will easily be over couple of hundred TB in a year. Now it isn't as if I have one or two use cases to run on these data sets. I need to run simple aggregation like counting, averaging to more advanced analytics. I also need to be able to search

Re: Improving MR job disk IO

2013-10-11 Thread DSuiter RDX
So, perhaps this has been thought of, but perhaps not. It is my understanding that grep is usually sorting things one line at a time. As I am currently experimenting with Avro, I am finding that the local grep function does not handle it well at all, because it is one long line essentially, so

Improving MR job disk IO

2013-10-10 Thread Xuri Nagarin
Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple

Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013

Re: Improving MR job disk IO

2013-10-10 Thread Xuri Nagarin
Thanks Pradeep. Does it mean this job is a bad candidate for MR? Interestingly, running the cmdline '/bin/grep' under a streaming job provides (1) Much better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013

Re: Improving MR job disk IO

2013-10-10 Thread Pradeep Gollakota
I don't think it necessarily means that the job is a bad candidate for MR. It's a different type of a workload. Hortonworks has a great article on the different types of workloads you might see and how that affects your provisioning choices at

Re: Improving MR job disk IO

2013-10-10 Thread Xuri Nagarin
On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't think it necessarily means that the job is a bad candidate for MR. It's a different type of a workload. Hortonworks has a great article on the different types of workloads you might see and how that affects