Re: Faster alternative to FSDataInputStream

Ananth T. Sarathy Wed, 19 Aug 2009 13:25:14 -0700

Also, I just want to clear... the delay seems to at the intial

(read = in.read(buf))


after the first time into the loop it flies...

Ananth T Sarathy


On Wed, Aug 19, 2009 at 1:58 PM, Raghu Angadi <rang...@yahoo-inc.com> wrote:

> Edward Capriolo wrote:
>
>> On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo <edlinuxg...@gmail.com
>>> >wrote:
>>>
>>>  It would be as fast as underlying filesystem goes.
>>>>>>
>>>>> I would not agree with that statement. There is overhead.
>>>>
>>>
> You might be misinterpreting my comment. There is of course some over head
> (at the least the procedure calls).. depending on you underlying filesystem,
> there could be extra buffer copies and CRC overhead. But none of that
> explains transfer as slow as 1 MBps (if my interpretation of of results is
> correct).
>
> Raghu.
>
>
>  In some testing I did writing a small file can
>>>> take 30-300 ms. So if you have 9000 small files (like I did) and you
>>>> are single threaded this takes a long time.
>>>>
>>>> If you orchestrate your task to use FSDataInput and FSDataOutput in
>>>> the map or reduce phase then each mapper or reducer is writing a file
>>>> at a time. Now that is fast.
>>>>
>>>> Ananth, are you doing your r/w inside a map/reduce job or are you just
>>>> using FS* in a top down program?
>>>>
>>>>
>>>>
>>>> On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<rang...@yahoo-inc.com>
>>>> wrote:
>>>>
>>>>> Ananth T. Sarathy wrote:
>>>>>
>>>>>> I am trying to download binary files stored in Hadoop but there is
>>>>>> like
>>>>>>
>>>>> a
>>>>
>>>>> 2
>>>>>> minute wait on a 20mb file when I try to execute the in.read(buf).
>>>>>>
>>>>> What does this mean : 2 min to pipe 20mb or one or your one of the
>>>>>
>>>> in.read()
>>>>
>>>>> calls took 2 minutes? Your code actually measures team for read and
>>>>>
>>>> write.
>>>>
>>>>> There is nothing in FSInputstream to cause this slow down. Do you think
>>>>> anyone would use Hadoop otherwise? It would be as fast as underlying
>>>>> filesystem goes.
>>>>>
>>>>> Raghu.
>>>>>
>>>>>  is there a better way to be doing this?
>>>>>>
>>>>>>   private void pipe(InputStream in, OutputStream out) throws
>>>>>>
>>>>> IOException
>>>>
>>>>>   {    System.out.println(System.currentTimeMillis()+" Starting to Pipe
>>>>>> Data");
>>>>>>       byte[] buf = new byte[1024];
>>>>>>       int read = 0;
>>>>>>       while ((read = in.read(buf)) >= 0)
>>>>>>       {
>>>>>>           out.write(buf, 0, read);
>>>>>>           System.out.println(System.currentTimeMillis()+" Piping
>>>>>>
>>>>> Data");
>>>>
>>>>>       }
>>>>>>       out.flush();
>>>>>>       System.out.println(System.currentTimeMillis()+" Finished Piping
>>>>>> Data");
>>>>>>
>>>>>>   }
>>>>>>
>>>>>> public void readFile(String fileToRead, OutputStream out)
>>>>>>           throws IOException
>>>>>>   {
>>>>>>       System.out.println(System.currentTimeMillis()+" Start Read
>>>>>>
>>>>> File");
>>>>
>>>>>       Path inFile = new Path(fileToRead);
>>>>>>       System.out.println(System.currentTimeMillis()+" Set Path");
>>>>>>       // Validate the input/output paths before reading/writing.
>>>>>>
>>>>>>       if (!fs.exists(inFile))
>>>>>>       {
>>>>>>           throw new HadoopFileException("Specified file  " +
>>>>>> fileToRead
>>>>>>                   + " not found.");
>>>>>>       }
>>>>>>       if (!fs.isFile(inFile))
>>>>>>       {
>>>>>>           throw new HadoopFileException("Specified file  " +
>>>>>> fileToRead
>>>>>>                   + " not found.");
>>>>>>       }
>>>>>>       // Open inFile for reading.
>>>>>>       System.out.println(System.currentTimeMillis()+" Opening Data
>>>>>> Stream");
>>>>>>       FSDataInputStream in = fs.open(inFile);
>>>>>>
>>>>>>       System.out.println(System.currentTimeMillis()+" Opened Data
>>>>>> Stream");
>>>>>>       // Open outFile for writing.
>>>>>>
>>>>>>       // Read from input stream and write to output stream until EOF.
>>>>>>       pipe(in, out);
>>>>>>
>>>>>>       // Close the streams when done.
>>>>>>       out.close();
>>>>>>       in.close();
>>>>>>   }
>>>>>> Ananth T Sarathy
>>>>>>
>>>>>>
>>>>>
>

Re: Faster alternative to FSDataInputStream

Reply via email to