There are a few reasons to use map/reduce, or just map-only or reduce-only jobs. 1) You want to do parallel algorithms where data from multiple machines have to be cross-checked. Map/Reduce allows this. 2) You want to run several instances of a job. Hadoop does this reliably by monitoring all instances, restarting failed ones, etc.
3) You have way too much data to fit on one computer. Same as #2.

You might not need Hadoop if you can run your programs without it.


Yes, I tested with smaller data sets and the MR job correctly reads/matches one line at a time.

    So, perhaps this has been thought of, but perhaps not.

    It is my understanding that grep is usually sorting things one
    line at a time. As I am currently experimenting with Avro, I am
    finding that the local grep function does not handle it well at
    all, because it is one long line essentially, so working from
    local Avro, grep does not do well at pattern matching, it just
    returns the whole file as a match, and it takes a long time to
    view it in vi editor as well since there are no EOL markers.

    If you have modified for sequence file, are you reading a sequence
    file that has newline characters? If not, perhaps the file is
    being read as one whole line, causing some unexpected effects.

            I don't think it necessarily means that the job is a bad
            candidate for MR. It's a different type of a workload.
            Hortonworks has a great article on the different types of
            workloads you might see and how that affects your
            provisioning choices at

        One statement that stood out to me in the link above is "For
        these reasons, Hortonworks recommends that you either use the
        Balanced workload configuration or invest in a pilot Hadoop
        cluster and plan to evolve as you analyze the workload
        patterns in your environment."

        Now, this is not a critique/concern of HW but rather of
        hadoop. Well, what if my workloads can be both CPU and IO
        intensive? Do I take the approach of

            I have not looked at the Grep code so I'm not sure why
            it's behaving the way it is. Still curious that streaming
            has a higher IO throughput and lower CPU usage. It may
            have to do with the fact that /bin/grep is a native
            implementation and Grep (Hadoop) is probably using Java
            Pattern/Matcher api.

        The Grep code is from the bundled examples in CDH. I made one
        line modification for it to read Sequence files. The streaming
        job probably does not have lower CPU utilization but I see
        that it does even out the CPU utilization among all the
        available processors. I guess the native grep binary threads
        better than the java MR job?

        Which brings me to ask - If you have the mapper/reducer
        functionality built into a platform specific binary, then
        won't it always be more efficient than a java MR job? And, in
        such cases, am I better off with streaming than Java MR?

        Thanks for your responses.

                Thanks Pradeep. Does it mean this job is a bad
                candidate for MR?

                Interestingly, running the cmdline '/bin/grep' under a
                streaming job provides (1) Much better disk throughput
                and, (2) CPU load is almost evenly spread across all
                cores/threads (no CPU gets pegged to 100%).

                    Actually... I believe that is expected behavior.
                    Since your CPU is pegged at 100% you're not going
                    to be IO bound. Typically jobs tend to be CPU
                    bound or IO bound. If you're CPU bound you expect
                    to see low IO throughput. If you're IO bound, you
                    expect to see low CPU usage.

                        I have a simple Grep job (from bundled
                        examples) that I am running on a 11-node
                        cluster. Each node is 2x8-core Intel Xeons
                        (shows 32 CPUs with HT on), 64GB RAM and 8 x
                        1TB disks. I have mappers set to 20 per node.

                        When I run the Grep job, I notice that CPU
                        gets pegged to 100% on multiple cores but disk
                        throughput remains a dismal 1-2 Mbytes/sec on
                        a single disk on each node. So I guess, the
                        cluster is poorly performing in terms of disk
                        IO. Running Terasort, I see each disk puts out
                        25-35 Mbytes/sec with a total cluster
                        throughput of above 1.5 Gbytes/sec.

                        How do I go about re-configuring or re-writing
                        the job to utilize maximum disk IO?



