FSDataInputStream

2016-07-29 Thread Kristoffer Sjögren
Hi

We're seeing exceptions when closing a FSDataInputStream. I'm not sure
how to interpret the exception. Is there anything that can be done to
avoid it?

Cheers,
-Kristoffer

[2016-07-29 09:28:20,162] ERROR Error closing
hdfs://hdpcluster/tmp/kafka-connect/logs/sting_actions_inscreen/83/log.
(io.confluent.connect.hdfs.TopicPartitionWriter:328)
org.apache.kafka.connect.errors.ConnectException: Error closing
hdfs://hdpcluster/tmp/kafka-connect/logs/sting_actions_inscreen/83/log
at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:156)
at 
io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:326)
at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296)
at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:109)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:54)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsRevoked(WorkerSinkTask.java:465)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:283)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:212)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:345)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:977)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:305)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:222)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:170)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:142)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-141202528-10.3.138.26-1448020478061:blk_1098384937_24779008 does
not exist or is not under Constructionnull
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6344)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6411)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:870)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy48.updateBlockForPipeline(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:877)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy49.updateBlockForPipeline(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1266)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594

manipulate DFSInputStream in FSDataInputStream?

2013-12-14 Thread Nan Zhu
Hi, all  

I’m modifying FSDataInputStream for some project,  

I would like to directly manipulate “in object in my implementation  

as in the constructor a DFSInputStream is passed, so I convert “in” from 
InputStream to DFSInputStream with  

import org.apache.hadoop.hdfs.DFSClient;

DFSClient.DFSInputStream dins = (DFSClient.DFSInputStream) in;
dins.somemethod(…)


when I compile my code with ant

it says that  

[javac] 
/Users/zhunan/codes/SDNBigData/hadoop-1.2.1/src/core/org/apache/hadoop/fs/FSDataInputStream.java:20:
 error: package org.apache.hadoop.hdfs does not exist  
[javac] import org.apache.hadoop.hdfs.DFSClient;



What does this mean?

it means that core is compiled before hdfs, so I cannot do this?

Thank you very much!

Best,

--  
Nan Zhu
School of Computer Science,
McGill University



Re: manipulate DFSInputStream in FSDataInputStream?

2013-12-14 Thread Nan Zhu
Solved by declare an empty somemethod() in FSInputStream and override it in 
DFSInputStream  



--  
Nan Zhu
School of Computer Science,
McGill University



On Saturday, December 14, 2013 at 7:53 PM, Nan Zhu wrote:

 Hi, all  
  
 I’m modifying FSDataInputStream for some project,  
  
 I would like to directly manipulate “in object in my implementation  
  
 as in the constructor a DFSInputStream is passed, so I convert “in” from 
 InputStream to DFSInputStream with  
  
 import org.apache.hadoop.hdfs.DFSClient;
  
 DFSClient.DFSInputStream dins = (DFSClient.DFSInputStream) in;
 dins.somemethod(…)
  
  
 when I compile my code with ant
  
 it says that  
  
 [javac] 
 /Users/zhunan/codes/SDNBigData/hadoop-1.2.1/src/core/org/apache/hadoop/fs/FSDataInputStream.java:20:
  error: package org.apache.hadoop.hdfs does not exist  
 [javac] import org.apache.hadoop.hdfs.DFSClient;
  
  
  
 What does this mean?
  
 it means that core is compiled before hdfs, so I cannot do this?
  
 Thank you very much!
  
 Best,
  
 --  
 Nan Zhu
 School of Computer Science,
 McGill University
  



Two questions about FSDataInputStream

2013-11-12 Thread John Lilley
First, this documentation: 
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html
claims that FSDataInputStream has a seek() method, but javap doesn't show one:
$ javap -classpath [haddoopjars]  org.apache.hadoop.fs.FSDataInputStream
Compiled from FSDataInputStream.java
public class org.apache.hadoop.fs.FSDataInputStream extends 
java.io.DataInputStream implements org.apache.hadoop.fs.Seek
able,org.apache.hadoop.fs.PositionedReadable,java.io.Closeable,org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop
.fs.HasFileDescriptor,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead
 {
  public org.apache.hadoop.fs.FSDataInputStream(java.io.InputStream) throws 
java.io.IOException;
  public synchronized void seek(long) throws java.io.IOException;
  public long getPos() throws java.io.IOException;
  public int read(long, byte[], int, int) throws java.io.IOException;
  public void readFully(long, byte[], int, int) throws java.io.IOException;
  public void readFully(long, byte[]) throws java.io.IOException;
  public boolean seekToNewSource(long) throws java.io.IOException;
  public java.io.InputStream getWrappedStream();
  public int read(java.nio.ByteBuffer) throws java.io.IOException;
  public java.io.FileDescriptor getFileDescriptor() throws java.io.IOException;
  public void setReadahead(java.lang.Long) throws java.io.IOException, 
java.lang.UnsupportedOperationException;
  public void setDropBehind(java.lang.Boolean) throws java.io.IOException, 
java.lang.UnsupportedOperationException;
}

Second, after every call to inputStream.read(position, byteArray, 0, size), the 
getPos() call returns the same answer.  Should it change?

Given the lack of all these things, how is one supposed to call 
read(ByteBuffer) for random I/O?

john



Re: Two questions about FSDataInputStream

2013-11-12 Thread Ted Yu
For #1, I see the following in output of javap:

  public synchronized void seek(long) throws java.io.IOException;

which is described here:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html#seek(long)


On Tue, Nov 12, 2013 at 3:08 PM, John Lilley john.lil...@redpoint.netwrote:

  First, this documentation:
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html

 claims that FSDataInputStream has a seek() method, but javap doesn’t show
 one:

 $ javap -classpath [haddoopjars]  org.apache.hadoop.fs.FSDataInputStream

 Compiled from FSDataInputStream.java

 public class org.apache.hadoop.fs.FSDataInputStream extends
 java.io.DataInputStream implements org.apache.hadoop.fs.Seek


 able,org.apache.hadoop.fs.PositionedReadable,java.io.Closeable,org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop

 .fs.HasFileDescriptor,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead
 {

   public org.apache.hadoop.fs.FSDataInputStream(java.io.InputStream)
 throws java.io.IOException;

   public synchronized void seek(long) throws java.io.IOException;

   public long getPos() throws java.io.IOException;

   public int read(long, byte[], int, int) throws java.io.IOException;

   public void readFully(long, byte[], int, int) throws java.io.IOException;

   public void readFully(long, byte[]) throws java.io.IOException;

   public boolean seekToNewSource(long) throws java.io.IOException;

   public java.io.InputStream getWrappedStream();

   public int read(java.nio.ByteBuffer) throws java.io.IOException;

   public java.io.FileDescriptor getFileDescriptor() throws
 java.io.IOException;

   public void setReadahead(java.lang.Long) throws java.io.IOException,
 java.lang.UnsupportedOperationException;

   public void setDropBehind(java.lang.Boolean) throws java.io.IOException,
 java.lang.UnsupportedOperationException;

 }



 Second, after every call to inputStream.read(position, byteArray, 0,
 size), the getPos() call returns the same answer.  Should it change?



 Given the lack of all these things, how is one supposed to call
 read(ByteBuffer) for random I/O?



 john





Re: Faster alternative to FSDataInputStream

2009-08-21 Thread Jason Venner
It may be some kind of hostname name or reverse lookup delay, either on the
origination or destination side.

On Thu, Aug 20, 2009 at 10:43 AM, Raghu Angadi rang...@yahoo-inc.comwrote:

 Ananth T. Sarathy wrote:

 it's on s3. and it always happens.


 I have no experience with S3. You might want to check out S3 forums. It
 can't be normal for S3 either.. there must be something missing
 (configuration, ACLs... ).

 Raghu.


  Ananth T Sarathy


 On Wed, Aug 19, 2009 at 4:35 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  Ananth T. Sarathy wrote:

  Also, I just want to clear... the delay seems to at the intial

 (read = in.read(buf))

  It the file on HDFS (over S3) or S3?

 Does it always happen?

 Raghu.


  after the first time into the loop it flies...

 Ananth T Sarathy


 On Wed, Aug 19, 2009 at 1:58 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  Edward Capriolo wrote:

  On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo 

 edlinuxg...@gmail.com

  wrote:
  It would be as fast as underlying filesystem goes.

  I would not agree with that statement. There is overhead.
 You might be misinterpreting my comment. There is of course some
 over

 head
 (at the least the procedure calls).. depending on you underlying
 filesystem,
 there could be extra buffer copies and CRC overhead. But none of that
 explains transfer as slow as 1 MBps (if my interpretation of of results
 is
 correct).

 Raghu.








-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
If it always takes a very long time to start transferring data, get a few
stack dumps (jstack or kill -e) during this period to see what it is doing
during this time.

Most likely, the client is doing nothing but waiting on the remote side.


On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote:

 it's not really 1 mbps so much it takes 2 minutes to start doing the
 reads.
 
 Ananth T Sarathy
 
 
 On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.comwrote:
 
 
 On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
 
 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.
 
 
 Yes, there is nothing about distributing work for parallel execution that
 is
 going to make a single 20MB file transfer faster.   That is very slow, and
 should be on the order of a second or so, not multiple minutes.
  Something else is wrong.
 
 
 
 



Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey

On 8/20/09 9:48 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote:

 ok.. i seems that's the case.  that seems kind of  selfdefeating though.
 
 Ananth T Sarathy

Then something is wrong with S3.  It may be misconfigured, or just poor
performance.  I have no experience with S3 but 20 seconds to connect
(authenticate?) and open a file seems very slow for any file system.

 
 
 On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey sc...@richrelevance.comwrote:
 
 If it always takes a very long time to start transferring data, get a few
 stack dumps (jstack or kill -e) during this period to see what it is doing
 during this time.
 
 Most likely, the client is doing nothing but waiting on the remote side.
 
 
 On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com
 wrote:
 
 it's not really 1 mbps so much it takes 2 minutes to start doing the
 reads.
 
 Ananth T Sarathy
 
 
 On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.com
 wrote:
 
 
 On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
 
 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.
 
 
 Yes, there is nothing about distributing work for parallel execution
 that
 is
 going to make a single 20MB file transfer faster.   That is very slow,
 and
 should be on the order of a second or so, not multiple minutes.
  Something else is wrong.
 
 
 
 
 
 
 



Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Ananth T. Sarathy
I am not saying there is a slowdown cause by hadoop. I was wondering if
there were anyother techinques that optimize speed (IE reading a little a
time and writing to the local disk).
Ananth T Sarathy


On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi rang...@yahoo-inc.com wrote:

 Ananth T. Sarathy wrote:

 I am trying to download binary files stored in Hadoop but there is like a
 2
 minute wait on a 20mb file when I try to execute the in.read(buf).


 What does this mean : 2 min to pipe 20mb or one or your one of the
 in.read() calls took 2 minutes? Your code actually measures team for read
 and write.

 There is nothing in FSInputstream to cause this slow down. Do you think
 anyone would use Hadoop otherwise? It would be as fast as underlying
 filesystem goes.

 Raghu.


  is there a better way to be doing this?

private void pipe(InputStream in, OutputStream out) throws IOException
{System.out.println(System.currentTimeMillis()+ Starting to Pipe
 Data);
byte[] buf = new byte[1024];
int read = 0;
while ((read = in.read(buf)) = 0)
{
out.write(buf, 0, read);
System.out.println(System.currentTimeMillis()+ Piping Data);
}
out.flush();
System.out.println(System.currentTimeMillis()+ Finished Piping
 Data);

}

 public void readFile(String fileToRead, OutputStream out)
throws IOException
{
System.out.println(System.currentTimeMillis()+ Start Read File);
Path inFile = new Path(fileToRead);
System.out.println(System.currentTimeMillis()+ Set Path);
// Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
if (!fs.isFile(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
// Open inFile for reading.
System.out.println(System.currentTimeMillis()+ Opening Data
 Stream);
FSDataInputStream in = fs.open(inFile);

System.out.println(System.currentTimeMillis()+ Opened Data
 Stream);
// Open outFile for writing.

// Read from input stream and write to output stream until EOF.
pipe(in, out);

// Close the streams when done.
out.close();
in.close();
}
 Ananth T Sarathy





Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Edward Capriolo
Ananth,

That is your issue really.

For example. I have 20 web servers and I wish to download all the
weblogs from all of them into hadoop.

If you write a top down program that uses FSDataOutput. You are using
hadoop half way. You are using the distributed file system, but you
are not doing any distributed processing.

Better is to specify all the servers/files you with to download as
your input file. Tell hadoop to use NLineInput format. Move your code
inside a map function.  Now since hadoop ran run multiple mappers
using -Dmapred.map.tasks=6  will cause 6 fetchers to run in parallel.
You can up this as high as you are comfortable with.

Also now that you are using m/r you don't have to write files with
FSDataOuputStream , you can use output.collect() to make a sequence
file.

In my case I am using commons-FTP and FSDataOutputStream (not using
output.collect() ) as I do not want a big sequence file I want the
actual files as they exist on the web server I will merge them down
the line in my process. This works very well. I could turn the number
of mappers higher, but I don't want to beat up my web servers and
network anymore. (hint: turn off speculative execution)

Now you know all my secrets. Good luck :)


On Wed, Aug 19, 2009 at 11:45 AM, Ananth T.
Sarathyananth.t.sara...@gmail.com wrote:
 Right now just in top down program. I am still learning this, so I need put
 this in a map and reduce to optimize speed I will. Right now I am just
 testing certain things, and getting a skeleton to write and pull files from
 the s3 storage. Actual implementation is still being engineered.


 Ananth T Sarathy


 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead. If you have
 a single threaded process writing many small files you do not get the
 parallel write speed. In some testing I did writing a small file can
 take 30-300 ms. So if you have 9000 small files (like I did) and you
 are single threaded this takes a long time.

 If you orchestrate your task to use FSDataInput and FSDataOutput in
 the map or reduce phase then each mapper or reducer is writing a file
 at a time. Now that is fast.

 Ananth, are you doing your r/w inside a map/reduce job or are you just
 using FS* in a top down program?



 On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadirang...@yahoo-inc.com
 wrote:
  Ananth T. Sarathy wrote:
 
  I am trying to download binary files stored in Hadoop but there is like
 a
  2
  minute wait on a 20mb file when I try to execute the in.read(buf).
 
  What does this mean : 2 min to pipe 20mb or one or your one of the
 in.read()
  calls took 2 minutes? Your code actually measures team for read and
 write.
 
  There is nothing in FSInputstream to cause this slow down. Do you think
  anyone would use Hadoop otherwise? It would be as fast as underlying
  filesystem goes.
 
  Raghu.
 
  is there a better way to be doing this?
 
     private void pipe(InputStream in, OutputStream out) throws
 IOException
     {    System.out.println(System.currentTimeMillis()+ Starting to Pipe
  Data);
         byte[] buf = new byte[1024];
         int read = 0;
         while ((read = in.read(buf)) = 0)
         {
             out.write(buf, 0, read);
             System.out.println(System.currentTimeMillis()+ Piping
 Data);
         }
         out.flush();
         System.out.println(System.currentTimeMillis()+ Finished Piping
  Data);
 
     }
 
  public void readFile(String fileToRead, OutputStream out)
             throws IOException
     {
         System.out.println(System.currentTimeMillis()+ Start Read
 File);
         Path inFile = new Path(fileToRead);
         System.out.println(System.currentTimeMillis()+ Set Path);
         // Validate the input/output paths before reading/writing.
 
         if (!fs.exists(inFile))
         {
             throw new HadoopFileException(Specified file   + fileToRead
                     +  not found.);
         }
         if (!fs.isFile(inFile))
         {
             throw new HadoopFileException(Specified file   + fileToRead
                     +  not found.);
         }
         // Open inFile for reading.
         System.out.println(System.currentTimeMillis()+ Opening Data
  Stream);
         FSDataInputStream in = fs.open(inFile);
 
         System.out.println(System.currentTimeMillis()+ Opened Data
  Stream);
         // Open outFile for writing.
 
         // Read from input stream and write to output stream until EOF.
         pipe(in, out);
 
         // Close the streams when done.
         out.close();
         in.close();
     }
  Ananth T Sarathy
 
 
 




Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Scott Carey

On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote:

 Edward Capriolo wrote:
 On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
 edlinuxg...@gmail.comwrote:
 
 It would be as fast as underlying filesystem goes.
 I would not agree with that statement. There is overhead.
 
 You might be misinterpreting my comment. There is of course some over
 head (at the least the procedure calls).. depending on you underlying
 filesystem, there could be extra buffer copies and CRC overhead. But
 none of that explains transfer as slow as 1 MBps (if my interpretation
 of of results is correct).
 
 Raghu.


Yes, there is nothing about distributing work for parallel execution that is
going to make a single 20MB file transfer faster.   That is very slow, and
should be on the order of a second or so, not multiple minutes.
 Something else is wrong.




Faster alternative to FSDataInputStream

2009-08-18 Thread Ananth T. Sarathy
I am trying to download binary files stored in Hadoop but there is like a 2
minute wait on a 20mb file when I try to execute the in.read(buf).

is there a better way to be doing this?

private void pipe(InputStream in, OutputStream out) throws IOException
{System.out.println(System.currentTimeMillis()+ Starting to Pipe
Data);
byte[] buf = new byte[1024];
int read = 0;
while ((read = in.read(buf)) = 0)
{
out.write(buf, 0, read);
System.out.println(System.currentTimeMillis()+ Piping Data);
}
out.flush();
System.out.println(System.currentTimeMillis()+ Finished Piping
Data);

}

public void readFile(String fileToRead, OutputStream out)
throws IOException
{
System.out.println(System.currentTimeMillis()+ Start Read File);
Path inFile = new Path(fileToRead);
System.out.println(System.currentTimeMillis()+ Set Path);
// Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
if (!fs.isFile(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
// Open inFile for reading.
System.out.println(System.currentTimeMillis()+ Opening Data
Stream);
FSDataInputStream in = fs.open(inFile);

System.out.println(System.currentTimeMillis()+ Opened Data
Stream);
// Open outFile for writing.

// Read from input stream and write to output stream until EOF.
pipe(in, out);

// Close the streams when done.
out.close();
in.close();
}
Ananth T Sarathy


Re: Faster alternative to FSDataInputStream

2009-08-18 Thread Qin Gao
Is the code called on Mapper/Reducer? If so probably DistributedCache is a
better solution
--Q


On Tue, Aug 18, 2009 at 12:00 PM, Ananth T. Sarathy 
ananth.t.sara...@gmail.com wrote:

 I am trying to download binary files stored in Hadoop but there is like a 2
 minute wait on a 20mb file when I try to execute the in.read(buf).

 is there a better way to be doing this?

private void pipe(InputStream in, OutputStream out) throws IOException
{System.out.println(System.currentTimeMillis()+ Starting to Pipe
 Data);
byte[] buf = new byte[1024];
int read = 0;
while ((read = in.read(buf)) = 0)
{
out.write(buf, 0, read);
System.out.println(System.currentTimeMillis()+ Piping Data);
}
out.flush();
System.out.println(System.currentTimeMillis()+ Finished Piping
 Data);

}

 public void readFile(String fileToRead, OutputStream out)
throws IOException
{
System.out.println(System.currentTimeMillis()+ Start Read File);
Path inFile = new Path(fileToRead);
System.out.println(System.currentTimeMillis()+ Set Path);
// Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
if (!fs.isFile(inFile))
{
throw new HadoopFileException(Specified file   + fileToRead
+  not found.);
}
// Open inFile for reading.
System.out.println(System.currentTimeMillis()+ Opening Data
 Stream);
FSDataInputStream in = fs.open(inFile);

System.out.println(System.currentTimeMillis()+ Opened Data
 Stream);
// Open outFile for writing.

// Read from input stream and write to output stream until EOF.
pipe(in, out);

// Close the streams when done.
out.close();
in.close();
}
 Ananth T Sarathy



Re: Faster alternative to FSDataInputStream

2009-08-18 Thread George Porter
You could try changing your buffer size from 1KB to a much higher
number (like 64MB).

i.e.,

  byte[] buf = new byte[1024*1024*64];

-George

On Tue, Aug 18, 2009 at 9:00 AM, Ananth T.
Sarathyananth.t.sara...@gmail.com wrote:
 I am trying to download binary files stored in Hadoop but there is like a 2
 minute wait on a 20mb file when I try to execute the in.read(buf).

 is there a better way to be doing this?

    private void pipe(InputStream in, OutputStream out) throws IOException
    {    System.out.println(System.currentTimeMillis()+ Starting to Pipe
 Data);
        byte[] buf = new byte[1024];
        int read = 0;
        while ((read = in.read(buf)) = 0)
        {
            out.write(buf, 0, read);
            System.out.println(System.currentTimeMillis()+ Piping Data);
        }
        out.flush();
        System.out.println(System.currentTimeMillis()+ Finished Piping
 Data);

    }

 public void readFile(String fileToRead, OutputStream out)
            throws IOException
    {
        System.out.println(System.currentTimeMillis()+ Start Read File);
        Path inFile = new Path(fileToRead);
        System.out.println(System.currentTimeMillis()+ Set Path);
        // Validate the input/output paths before reading/writing.

        if (!fs.exists(inFile))
        {
            throw new HadoopFileException(Specified file   + fileToRead
                    +  not found.);
        }
        if (!fs.isFile(inFile))
        {
            throw new HadoopFileException(Specified file   + fileToRead
                    +  not found.);
        }
        // Open inFile for reading.
        System.out.println(System.currentTimeMillis()+ Opening Data
 Stream);
        FSDataInputStream in = fs.open(inFile);

        System.out.println(System.currentTimeMillis()+ Opened Data
 Stream);
        // Open outFile for writing.

        // Read from input stream and write to output stream until EOF.
        pipe(in, out);

        // Close the streams when done.
        out.close();
        in.close();
    }
 Ananth T Sarathy