Re: Faster alternative to FSDataInputStream

Raghu Angadi Wed, 19 Aug 2009 11:00:50 -0700

Edward Capriolo wrote:

On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

It would be as fast as underlying filesystem goes.
I would not agree with that statement. There is overhead.

You might be misinterpreting my comment. There is of course some overhead (at the least the procedure calls).. depending on you underlyingfilesystem, there could be extra buffer copies and CRC overhead. Butnone of that explains transfer as slow as 1 MBps (if my interpretationof of results is correct).


Raghu.

In some testing I did writing a small file can
take 30-300 ms. So if you have 9000 small files (like I did) and you
are single threaded this takes a long time.

If you orchestrate your task to use FSDataInput and FSDataOutput in
the map or reduce phase then each mapper or reducer is writing a file
at a time. Now that is fast.

Ananth, are you doing your r/w inside a map/reduce job or are you just
using FS* in a top down program?



On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<rang...@yahoo-inc.com>
wrote:

Ananth T. Sarathy wrote:

I am trying to download binary files stored in Hadoop but there is like

2
minute wait on a 20mb file when I try to execute the in.read(buf).

What does this mean : 2 min to pipe 20mb or one or your one of the

in.read()

calls took 2 minutes? Your code actually measures team for read and

write.

There is nothing in FSInputstream to cause this slow down. Do you think
anyone would use Hadoop otherwise? It would be as fast as underlying
filesystem goes.

Raghu.

is there a better way to be doing this?

   private void pipe(InputStream in, OutputStream out) throws

IOException

   {    System.out.println(System.currentTimeMillis()+" Starting to Pipe
Data");
       byte[] buf = new byte[1024];
       int read = 0;
       while ((read = in.read(buf)) >= 0)
       {
           out.write(buf, 0, read);
           System.out.println(System.currentTimeMillis()+" Piping

Data");

       }
       out.flush();
       System.out.println(System.currentTimeMillis()+" Finished Piping
Data");

   }

public void readFile(String fileToRead, OutputStream out)
           throws IOException
   {
       System.out.println(System.currentTimeMillis()+" Start Read

File");

       Path inFile = new Path(fileToRead);
       System.out.println(System.currentTimeMillis()+" Set Path");
       // Validate the input/output paths before reading/writing.

       if (!fs.exists(inFile))
       {
           throw new HadoopFileException("Specified file  " + fileToRead
                   + " not found.");
       }
       if (!fs.isFile(inFile))
       {
           throw new HadoopFileException("Specified file  " + fileToRead
                   + " not found.");
       }
       // Open inFile for reading.
       System.out.println(System.currentTimeMillis()+" Opening Data
Stream");
       FSDataInputStream in = fs.open(inFile);

       System.out.println(System.currentTimeMillis()+" Opened Data
Stream");
       // Open outFile for writing.

       // Read from input stream and write to output stream until EOF.
       pipe(in, out);

       // Close the streams when done.
       out.close();
       in.close();
   }
Ananth T Sarathy

Re: Faster alternative to FSDataInputStream

Reply via email to