RE: Hadoop throughput question

Artem Ervits Fri, 04 Jan 2013 08:15:44 -0800

John, the two programs below, one is from the Definitive Guide chapter 4 with 
slight mods and the other is in-house but similar to Hadoop in Action chap 3.


package sequencefileprocessor;

// cc SequenceFileReadDemo Reading a SequenceFile
import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

// vv SequenceFileReadDemo
public class SequenceFileProcessor
{
    public static void main(String[] args) throws IOException
    {
        Configuration conf = new Configuration();
        conf.set("mapred.map.child.java.opts", "-Xmx256m");
        conf.set("mapred.reduce.child.java.opts", "-Xmx256m");
        //conf.set("io.file.buffer.size", "65536");  //10mb/sec improvement, 
jumped from 26mb/sec to 36mb/sec
        conf.set("io.file.buffer.size", "131072");  // 15mb/sec improvement, 
jumped from 26mb/sec to 39mb/sec

        FileSystem fs = null;
        Path path = null;
        int total_count = 0;
        int count = 0;
        long start = System.currentTimeMillis();

        for (String uri : args)
        {
            fs = FileSystem.get(URI.create(uri), conf);
            path = new Path(uri);

            SequenceFile.Reader reader = null;
            try
            {
                reader = new SequenceFile.Reader(fs, path, conf);
                Writable key = (Writable) 
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
                Writable value = (Writable) 
ReflectionUtils.newInstance(reader.getValueClass(), conf);
                long position = reader.getPosition();
                while (reader.next(key, value))
                {
                   String syncSeen = reader.syncSeen() ? "*" : "";
                    //System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, 
key, value);
                    position = reader.getPosition(); // beginning of next record
                    count += 1;
                    //System.out.println("count is: " + count);

                    if((count % 1000000) == 0)
                        System.out.println("processed " + count + " records");
                }
            }
            finally
            {
                IOUtils.closeStream(reader);
            }
        }
        total_count += count;
        System.out.println("Total count: " + total_count);
        System.out.println("Elapsed time: " + ((System.currentTimeMillis() - 
start) / 1000) + " seconds");
    }
}
// ^^ SequenceFileReadDemo


/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package hdfsspeedtest;

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;

/**** code is *****/

public class HDFSSpeedTest {

    public static void main(String[] args) throws Exception {

        System.out.println(new Date().toString());

        Path pt = new Path(args[0]);
        try {

            // Use this for reading the data.
            byte[] buffer = new byte[32*1024*1024];

            Configuration conf = new Configuration();
            //conf.set(null, null);
            FileSystem fs = FileSystem.get(conf);
            FileStatus[] inputFiles = fs.listStatus(pt);
            long total = 0;

           for(int i=0; i<inputFiles.length; i++)
           {
            //InputStreamReader inputStream = new 
InputStreamReader(fs.open(pt));


             if(inputFiles[i].getPath().getName().startsWith("part"))
             {
                System.out.println(inputFiles[i].getPath().getName());
                FSDataInputStream inputStream = 
fs.open(inputFiles[i].getPath());
                //inputStream.re

                // read fills buffer with data and returns
                // the number of bytes read (which of course
                // may be less than the buffer size, but
                // it will never be more).

                int nRead = 0;
                while((nRead = inputStream.read(buffer)) != -1) {
                    total += nRead;
                }
            // Always close files.
                inputStream.close();
             }

           }


            System.out.println("Read " + total + " bytes");


            System.out.println(new Date().toString());
        }
        catch(FileNotFoundException ex) {
            System.out.println(
                "Unable to open file '" +
                pt + "'");
        }
        catch(IOException ex) {
            System.out.println(
                "Error reading file '"
                + pt + "'");
            // Or we could just do this:
            // ex.printStackTrace();
        }
    }
}



From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Thursday, January 03, 2013 9:04 PM
To: user@hadoop.apache.org
Subject: RE: Hadoop throughput question


Perhaps if Artem posted the presumably-simple code we could get other users to 
benchmark other 4-node systems and compare.

--John Lilley



Artem Ervits <are9...@nyp.org<mailto:are9...@nyp.org>> wrote:


Setting the property to 64k made the throughput jump to 36mb/sec, 39mb for 128k.

Thank you for the tip.

From: Michael Katzenellenbogen 
[mailto:mich...@cloudera.com]<mailto:[mailto:mich...@cloudera.com]>
Sent: Thursday, January 03, 2013 7:28 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: Hadoop throughput question

What is the value of the io.file.buffer.size property? Try tuning it up to 64k 
or 128k and see if this improves performance when reading SequenceFiles.

-Michael

On Jan 3, 2013, at 7:00 PM, Artem Ervits 
<are9...@nyp.org<mailto:are9...@nyp.org>> wrote:
I will follow up on that certainly, thank you for the information.

So further investigation showed that counting SequenceFile records takes about 
26mb/sec. If I simply read bytes on the same cluster and the same file, the 
speed is 70mb/sec. Is there a configuration for optimizing SequenceFile 
processing?

Thank you.

From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Thursday, January 03, 2013 6:09 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: RE: Hadoop throughput question

Unless the Hadoop processing and the OneFS storage are co-located, MapReduce 
can't schedule tasks so as to take advantage of data locality.  You would 
basically be doing a distributed computation against a separate NAS, so 
throughput would be limited by the performance properties of the Insilon NAS 
and the network switch architecture.  Still, 26MB/sec in aggregate is far worse 
than what I'd expect Insilon to deliver, even over a single 1GB connection.
john

From: Artem Ervits [mailto:are9...@nyp.org]<mailto:[mailto:are9...@nyp.org]>
Sent: Thursday, January 03, 2013 4:02 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: RE: Hadoop throughput question

Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the Hadoop 
nodes are in the same datacenter but as far as rack locations, I cannot tell.

From: John Lilley [mailto:john.lil...@redpoint.net]
Sent: Thursday, January 03, 2013 5:15 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: RE: Hadoop throughput question

Let's suppose you are doing a read-intensive job like, for example, counting 
records.  This is will be disk bandwidth limited.  On a 4-node cluster with 2 
local SATA on each node you should easily read 400MB/sec in aggregate.  When 
you are running the Hadoop cluster, is the Hadoop processing co-located with 
the Ilsilon nodes?  Is Hadoop configured to use OneFS or HDFS?
John

From: Artem Ervits [mailto:are9...@nyp.org]<mailto:[mailto:are9...@nyp.org]>
Sent: Thursday, January 03, 2013 3:00 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Hadoop throughput question

Hello all,

I'd like to pick the community brain on average throughput speeds for a 
moderately specced 4-node Hadoop cluster with 1GigE networking. Is it 
reasonable to expect constant average speeds of 150-200mb/sec on such setup? 
Forgive me if the question is loaded but we're Hadoop cluster with HDFS served 
via EMC Isilon storage. We're getting about 30mb/sec with our machines and we 
do not see a difference in job speed between 2 node cluster and 4 node cluster.

Thank you.





--------------------



This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.





--------------------



This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.





--------------------



This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.



--------------------



This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.







--------------------

Confidential Information subject to NYP's (and its affiliates') information 
management and security policies (http://infonet.nyp.org/QA/HospManual/).



--------------------



This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.






--------------------

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.




--------------------

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.

RE: Hadoop throughput question

Reply via email to