Thanks Chris Your answer helps me a lot! And I got another idea. If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data. Does this make sense?
Tenghuan He On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cnaur...@hortonworks.com> wrote: > I think you can achieve something close to this with just public APIs by > launching multiple threads, calling FileSystem#open to get a separate input > stream in each one, and then calling seek to position each stream at a > different block boundary. Seek is a cheap operation, basically just > updating internal offsets. Seeking forward does not require reading > through the earlier data byte-by-byte, so you won't pay the cost of > transferring that part of the data. > > Whether or not this strategy would really improve performance is subject > to a lot of other factors. If the application's single-threaded reading > already saturates the network bandwidth of the NIC, then starting multiple > threads is unlikely to improve performance. Those threads will just run > into contention with each other on the scarce network bandwidth resources. > If instead the application reads data gradually and performs some > CPU-intensive processing as it reads, then perhaps the NIC is not > saturated, and multi-threading could help. > > As usual with performance work, the actual outcomes are going to be highly > situational. > > I hope this helps. > > --Chris Nauroth > > From: Tenghuan He <tenghua...@gmail.com> > Date: Thursday, December 31, 2015 at 5:17 PM > To: Chris Nauroth <cnaur...@hortonworks.com> > Cc: "user@hadoop.apache.org" <user@hadoop.apache.org> > Subject: Re: Directly reading from datanode using JAVA API got > socketTimeoutException > > The following is what I want to do. > When reading a big file across multi blocks, I want to read different > blocks from different node in parallel thus make reading big file faster. > Is that possible? > > Thanks > > On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cnaur...@hortonworks.com> > wrote: > >> Your code has connected to a DataNode's TCP port, and the DataNode server >> side is likely blocked expecting the client to send some kind of request >> defined in the Data Transfer Protocol. The client code here does not write >> a request, so the DataNode server doesn't know what to do. Instead, the >> client immediately goes into a blocking read. Since the DataNode server >> side doesn't know what to do, it's never going to write any bytes back to >> the socket connection, and therefore the client eventually times out on the >> read. >> >> Stepping back, please be aware that what you are trying to do is >> unsupported. Relying on private implementation details like this is likely >> to be brittle and buggy. As the HDFS code evolves in the future, there is >> no guarantee that what you do here will work the same way in future >> versions. There might not even be a connectToDN method in future versions >> if we decide to do some internal refactoring. >> >> If you can give a high-level description of what you want to achieve, >> then perhaps we can suggest a way to do it through the public API. >> >> --Chris Nauroth >> >> From: Tenghuan He <tenghua...@gmail.com> >> Date: Wednesday, December 30, 2015 at 9:29 AM >> To: "user@hadoop.apache.org" <user@hadoop.apache.org> >> Subject: Directly reading from datanode using JAVA API got >> socketTimeoutException >> >> Hello, >> >> I want to directly read from datanode blocks using JAVA API as the >> following code, but I got socketTimeoutException >> >> I use reflection to call the DFSClient private method connectToDN(...), >> and get IOStreamPair of in and out, where in is used to read bytes from >> datanode. >> The workhorse code is >> >> try { >> Method connectToDN; >> Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class}; >> connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", >> paraList); >> connectToDN.setAccessible(true); >> IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, >> datanode, timeout, lb); >> in = new DataInputStream(pair.in); >> System.out.println(in.getClass()); >> byte[] b = new byte[10000]; >> in.readFully(b); >> } catch (Exception e) { >> e.printStackTrace(); >> >> } >> >> and the exception is >> >> java.net.SocketTimeoutException: 11000 millis timeout while waiting for >> channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765 >> remote=/192.168.179.135:50010] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) >> at java.io.FilterInputStream.read(FilterInputStream.java:133) >> at java.io.DataInputStream.readFully(DataInputStream.java:195) >> at java.io.DataInputStream.readFully(DataInputStream.java:169) >> at BlocksList.main(BlocksList.java:69) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:497) >> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) >> >> Could anyone tell me where the problem is? >> >> Thanks & Begards >> >> Tenghuan He >> > >