FSDataInputStream
Hi We're seeing exceptions when closing a FSDataInputStream. I'm not sure how to interpret the exception. Is there anything that can be done to avoid it? Cheers, -Kristoffer [2016-07-29 09:28:20,162] ERROR Error closing hdfs://hdpcluster/tmp/kafka-connect/logs/sting_actions_inscreen/83/log. (io.confluent.connect.hdfs.TopicPartitionWriter:328) org.apache.kafka.connect.errors.ConnectException: Error closing hdfs://hdpcluster/tmp/kafka-connect/logs/sting_actions_inscreen/83/log at io.confluent.connect.hdfs.wal.FSWAL.close(FSWAL.java:156) at io.confluent.connect.hdfs.TopicPartitionWriter.close(TopicPartitionWriter.java:326) at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:296) at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:109) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290) at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421) at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:54) at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsRevoked(WorkerSinkTask.java:465) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:283) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:212) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:345) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:977) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:305) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:222) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:170) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:142) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-141202528-10.3.138.26-1448020478061:blk_1098384937_24779008 does not exist or is not under Constructionnull at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6344) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6411) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:870) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:955) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy48.updateBlockForPipeline(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:877) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy49.updateBlockForPipeline(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1266) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:594
manipulate DFSInputStream in FSDataInputStream?
Hi, all I’m modifying FSDataInputStream for some project, I would like to directly manipulate “in object in my implementation as in the constructor a DFSInputStream is passed, so I convert “in” from InputStream to DFSInputStream with import org.apache.hadoop.hdfs.DFSClient; DFSClient.DFSInputStream dins = (DFSClient.DFSInputStream) in; dins.somemethod(…) when I compile my code with ant it says that [javac] /Users/zhunan/codes/SDNBigData/hadoop-1.2.1/src/core/org/apache/hadoop/fs/FSDataInputStream.java:20: error: package org.apache.hadoop.hdfs does not exist [javac] import org.apache.hadoop.hdfs.DFSClient; What does this mean? it means that core is compiled before hdfs, so I cannot do this? Thank you very much! Best, -- Nan Zhu School of Computer Science, McGill University
Re: manipulate DFSInputStream in FSDataInputStream?
Solved by declare an empty somemethod() in FSInputStream and override it in DFSInputStream -- Nan Zhu School of Computer Science, McGill University On Saturday, December 14, 2013 at 7:53 PM, Nan Zhu wrote: Hi, all I’m modifying FSDataInputStream for some project, I would like to directly manipulate “in object in my implementation as in the constructor a DFSInputStream is passed, so I convert “in” from InputStream to DFSInputStream with import org.apache.hadoop.hdfs.DFSClient; DFSClient.DFSInputStream dins = (DFSClient.DFSInputStream) in; dins.somemethod(…) when I compile my code with ant it says that [javac] /Users/zhunan/codes/SDNBigData/hadoop-1.2.1/src/core/org/apache/hadoop/fs/FSDataInputStream.java:20: error: package org.apache.hadoop.hdfs does not exist [javac] import org.apache.hadoop.hdfs.DFSClient; What does this mean? it means that core is compiled before hdfs, so I cannot do this? Thank you very much! Best, -- Nan Zhu School of Computer Science, McGill University
Two questions about FSDataInputStream
First, this documentation: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html claims that FSDataInputStream has a seek() method, but javap doesn't show one: $ javap -classpath [haddoopjars] org.apache.hadoop.fs.FSDataInputStream Compiled from FSDataInputStream.java public class org.apache.hadoop.fs.FSDataInputStream extends java.io.DataInputStream implements org.apache.hadoop.fs.Seek able,org.apache.hadoop.fs.PositionedReadable,java.io.Closeable,org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop .fs.HasFileDescriptor,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead { public org.apache.hadoop.fs.FSDataInputStream(java.io.InputStream) throws java.io.IOException; public synchronized void seek(long) throws java.io.IOException; public long getPos() throws java.io.IOException; public int read(long, byte[], int, int) throws java.io.IOException; public void readFully(long, byte[], int, int) throws java.io.IOException; public void readFully(long, byte[]) throws java.io.IOException; public boolean seekToNewSource(long) throws java.io.IOException; public java.io.InputStream getWrappedStream(); public int read(java.nio.ByteBuffer) throws java.io.IOException; public java.io.FileDescriptor getFileDescriptor() throws java.io.IOException; public void setReadahead(java.lang.Long) throws java.io.IOException, java.lang.UnsupportedOperationException; public void setDropBehind(java.lang.Boolean) throws java.io.IOException, java.lang.UnsupportedOperationException; } Second, after every call to inputStream.read(position, byteArray, 0, size), the getPos() call returns the same answer. Should it change? Given the lack of all these things, how is one supposed to call read(ByteBuffer) for random I/O? john
Re: Two questions about FSDataInputStream
For #1, I see the following in output of javap: public synchronized void seek(long) throws java.io.IOException; which is described here: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html#seek(long) On Tue, Nov 12, 2013 at 3:08 PM, John Lilley john.lil...@redpoint.netwrote: First, this documentation: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html claims that FSDataInputStream has a seek() method, but javap doesn’t show one: $ javap -classpath [haddoopjars] org.apache.hadoop.fs.FSDataInputStream Compiled from FSDataInputStream.java public class org.apache.hadoop.fs.FSDataInputStream extends java.io.DataInputStream implements org.apache.hadoop.fs.Seek able,org.apache.hadoop.fs.PositionedReadable,java.io.Closeable,org.apache.hadoop.fs.ByteBufferReadable,org.apache.hadoop .fs.HasFileDescriptor,org.apache.hadoop.fs.CanSetDropBehind,org.apache.hadoop.fs.CanSetReadahead { public org.apache.hadoop.fs.FSDataInputStream(java.io.InputStream) throws java.io.IOException; public synchronized void seek(long) throws java.io.IOException; public long getPos() throws java.io.IOException; public int read(long, byte[], int, int) throws java.io.IOException; public void readFully(long, byte[], int, int) throws java.io.IOException; public void readFully(long, byte[]) throws java.io.IOException; public boolean seekToNewSource(long) throws java.io.IOException; public java.io.InputStream getWrappedStream(); public int read(java.nio.ByteBuffer) throws java.io.IOException; public java.io.FileDescriptor getFileDescriptor() throws java.io.IOException; public void setReadahead(java.lang.Long) throws java.io.IOException, java.lang.UnsupportedOperationException; public void setDropBehind(java.lang.Boolean) throws java.io.IOException, java.lang.UnsupportedOperationException; } Second, after every call to inputStream.read(position, byteArray, 0, size), the getPos() call returns the same answer. Should it change? Given the lack of all these things, how is one supposed to call read(ByteBuffer) for random I/O? john
Re: Faster alternative to FSDataInputStream
It may be some kind of hostname name or reverse lookup delay, either on the origination or destination side. On Thu, Aug 20, 2009 at 10:43 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Ananth T. Sarathy wrote: it's on s3. and it always happens. I have no experience with S3. You might want to check out S3 forums. It can't be normal for S3 either.. there must be something missing (configuration, ACLs... ). Raghu. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:35 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Ananth T. Sarathy wrote: Also, I just want to clear... the delay seems to at the intial (read = in.read(buf)) It the file on HDFS (over S3) or S3? Does it always happen? Raghu. after the first time into the loop it flies... Ananth T Sarathy On Wed, Aug 19, 2009 at 1:58 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.com wrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Faster alternative to FSDataInputStream
If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: it's not really 1 mbps so much it takes 2 minutes to start doing the reads. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.comwrote: On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Re: Faster alternative to FSDataInputStream
On 8/20/09 9:48 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: ok.. i seems that's the case. that seems kind of selfdefeating though. Ananth T Sarathy Then something is wrong with S3. It may be misconfigured, or just poor performance. I have no experience with S3 but 20 seconds to connect (authenticate?) and open a file seems very slow for any file system. On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey sc...@richrelevance.comwrote: If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: it's not really 1 mbps so much it takes 2 minutes to start doing the reads. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey sc...@richrelevance.com wrote: On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Re: Faster alternative to FSDataInputStream
I am not saying there is a slowdown cause by hadoop. I was wondering if there were anyother techinques that optimize speed (IE reading a little a time and writing to the local disk). Ananth T Sarathy On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Ananth T. Sarathy wrote: I am trying to download binary files stored in Hadoop but there is like a 2 minute wait on a 20mb file when I try to execute the in.read(buf). What does this mean : 2 min to pipe 20mb or one or your one of the in.read() calls took 2 minutes? Your code actually measures team for read and write. There is nothing in FSInputstream to cause this slow down. Do you think anyone would use Hadoop otherwise? It would be as fast as underlying filesystem goes. Raghu. is there a better way to be doing this? private void pipe(InputStream in, OutputStream out) throws IOException {System.out.println(System.currentTimeMillis()+ Starting to Pipe Data); byte[] buf = new byte[1024]; int read = 0; while ((read = in.read(buf)) = 0) { out.write(buf, 0, read); System.out.println(System.currentTimeMillis()+ Piping Data); } out.flush(); System.out.println(System.currentTimeMillis()+ Finished Piping Data); } public void readFile(String fileToRead, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Start Read File); Path inFile = new Path(fileToRead); System.out.println(System.currentTimeMillis()+ Set Path); // Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } if (!fs.isFile(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } // Open inFile for reading. System.out.println(System.currentTimeMillis()+ Opening Data Stream); FSDataInputStream in = fs.open(inFile); System.out.println(System.currentTimeMillis()+ Opened Data Stream); // Open outFile for writing. // Read from input stream and write to output stream until EOF. pipe(in, out); // Close the streams when done. out.close(); in.close(); } Ananth T Sarathy
Re: Faster alternative to FSDataInputStream
Ananth, That is your issue really. For example. I have 20 web servers and I wish to download all the weblogs from all of them into hadoop. If you write a top down program that uses FSDataOutput. You are using hadoop half way. You are using the distributed file system, but you are not doing any distributed processing. Better is to specify all the servers/files you with to download as your input file. Tell hadoop to use NLineInput format. Move your code inside a map function. Now since hadoop ran run multiple mappers using -Dmapred.map.tasks=6 will cause 6 fetchers to run in parallel. You can up this as high as you are comfortable with. Also now that you are using m/r you don't have to write files with FSDataOuputStream , you can use output.collect() to make a sequence file. In my case I am using commons-FTP and FSDataOutputStream (not using output.collect() ) as I do not want a big sequence file I want the actual files as they exist on the web server I will merge them down the line in my process. This works very well. I could turn the number of mappers higher, but I don't want to beat up my web servers and network anymore. (hint: turn off speculative execution) Now you know all my secrets. Good luck :) On Wed, Aug 19, 2009 at 11:45 AM, Ananth T. Sarathyananth.t.sara...@gmail.com wrote: Right now just in top down program. I am still learning this, so I need put this in a map and reduce to optimize speed I will. Right now I am just testing certain things, and getting a skeleton to write and pull files from the s3 storage. Actual implementation is still being engineered. Ananth T Sarathy On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. If you have a single threaded process writing many small files you do not get the parallel write speed. In some testing I did writing a small file can take 30-300 ms. So if you have 9000 small files (like I did) and you are single threaded this takes a long time. If you orchestrate your task to use FSDataInput and FSDataOutput in the map or reduce phase then each mapper or reducer is writing a file at a time. Now that is fast. Ananth, are you doing your r/w inside a map/reduce job or are you just using FS* in a top down program? On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadirang...@yahoo-inc.com wrote: Ananth T. Sarathy wrote: I am trying to download binary files stored in Hadoop but there is like a 2 minute wait on a 20mb file when I try to execute the in.read(buf). What does this mean : 2 min to pipe 20mb or one or your one of the in.read() calls took 2 minutes? Your code actually measures team for read and write. There is nothing in FSInputstream to cause this slow down. Do you think anyone would use Hadoop otherwise? It would be as fast as underlying filesystem goes. Raghu. is there a better way to be doing this? private void pipe(InputStream in, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Starting to Pipe Data); byte[] buf = new byte[1024]; int read = 0; while ((read = in.read(buf)) = 0) { out.write(buf, 0, read); System.out.println(System.currentTimeMillis()+ Piping Data); } out.flush(); System.out.println(System.currentTimeMillis()+ Finished Piping Data); } public void readFile(String fileToRead, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Start Read File); Path inFile = new Path(fileToRead); System.out.println(System.currentTimeMillis()+ Set Path); // Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } if (!fs.isFile(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } // Open inFile for reading. System.out.println(System.currentTimeMillis()+ Opening Data Stream); FSDataInputStream in = fs.open(inFile); System.out.println(System.currentTimeMillis()+ Opened Data Stream); // Open outFile for writing. // Read from input stream and write to output stream until EOF. pipe(in, out); // Close the streams when done. out.close(); in.close(); } Ananth T Sarathy
Re: Faster alternative to FSDataInputStream
On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be misinterpreting my comment. There is of course some over head (at the least the procedure calls).. depending on you underlying filesystem, there could be extra buffer copies and CRC overhead. But none of that explains transfer as slow as 1 MBps (if my interpretation of of results is correct). Raghu. Yes, there is nothing about distributing work for parallel execution that is going to make a single 20MB file transfer faster. That is very slow, and should be on the order of a second or so, not multiple minutes. Something else is wrong.
Faster alternative to FSDataInputStream
I am trying to download binary files stored in Hadoop but there is like a 2 minute wait on a 20mb file when I try to execute the in.read(buf). is there a better way to be doing this? private void pipe(InputStream in, OutputStream out) throws IOException {System.out.println(System.currentTimeMillis()+ Starting to Pipe Data); byte[] buf = new byte[1024]; int read = 0; while ((read = in.read(buf)) = 0) { out.write(buf, 0, read); System.out.println(System.currentTimeMillis()+ Piping Data); } out.flush(); System.out.println(System.currentTimeMillis()+ Finished Piping Data); } public void readFile(String fileToRead, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Start Read File); Path inFile = new Path(fileToRead); System.out.println(System.currentTimeMillis()+ Set Path); // Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } if (!fs.isFile(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } // Open inFile for reading. System.out.println(System.currentTimeMillis()+ Opening Data Stream); FSDataInputStream in = fs.open(inFile); System.out.println(System.currentTimeMillis()+ Opened Data Stream); // Open outFile for writing. // Read from input stream and write to output stream until EOF. pipe(in, out); // Close the streams when done. out.close(); in.close(); } Ananth T Sarathy
Re: Faster alternative to FSDataInputStream
Is the code called on Mapper/Reducer? If so probably DistributedCache is a better solution --Q On Tue, Aug 18, 2009 at 12:00 PM, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote: I am trying to download binary files stored in Hadoop but there is like a 2 minute wait on a 20mb file when I try to execute the in.read(buf). is there a better way to be doing this? private void pipe(InputStream in, OutputStream out) throws IOException {System.out.println(System.currentTimeMillis()+ Starting to Pipe Data); byte[] buf = new byte[1024]; int read = 0; while ((read = in.read(buf)) = 0) { out.write(buf, 0, read); System.out.println(System.currentTimeMillis()+ Piping Data); } out.flush(); System.out.println(System.currentTimeMillis()+ Finished Piping Data); } public void readFile(String fileToRead, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Start Read File); Path inFile = new Path(fileToRead); System.out.println(System.currentTimeMillis()+ Set Path); // Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } if (!fs.isFile(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } // Open inFile for reading. System.out.println(System.currentTimeMillis()+ Opening Data Stream); FSDataInputStream in = fs.open(inFile); System.out.println(System.currentTimeMillis()+ Opened Data Stream); // Open outFile for writing. // Read from input stream and write to output stream until EOF. pipe(in, out); // Close the streams when done. out.close(); in.close(); } Ananth T Sarathy
Re: Faster alternative to FSDataInputStream
You could try changing your buffer size from 1KB to a much higher number (like 64MB). i.e., byte[] buf = new byte[1024*1024*64]; -George On Tue, Aug 18, 2009 at 9:00 AM, Ananth T. Sarathyananth.t.sara...@gmail.com wrote: I am trying to download binary files stored in Hadoop but there is like a 2 minute wait on a 20mb file when I try to execute the in.read(buf). is there a better way to be doing this? private void pipe(InputStream in, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Starting to Pipe Data); byte[] buf = new byte[1024]; int read = 0; while ((read = in.read(buf)) = 0) { out.write(buf, 0, read); System.out.println(System.currentTimeMillis()+ Piping Data); } out.flush(); System.out.println(System.currentTimeMillis()+ Finished Piping Data); } public void readFile(String fileToRead, OutputStream out) throws IOException { System.out.println(System.currentTimeMillis()+ Start Read File); Path inFile = new Path(fileToRead); System.out.println(System.currentTimeMillis()+ Set Path); // Validate the input/output paths before reading/writing. if (!fs.exists(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } if (!fs.isFile(inFile)) { throw new HadoopFileException(Specified file + fileToRead + not found.); } // Open inFile for reading. System.out.println(System.currentTimeMillis()+ Opening Data Stream); FSDataInputStream in = fs.open(inFile); System.out.println(System.currentTimeMillis()+ Opened Data Stream); // Open outFile for writing. // Read from input stream and write to output stream until EOF. pipe(in, out); // Close the streams when done. out.close(); in.close(); } Ananth T Sarathy