Maybe contact Oracle support?

Do you have maybe accidentally configured some firewall rules? Routing issues? 
Maybe only one of the nodes...

> On 9. Nov 2017, at 20:04, Jan-Hendrik Zab <> wrote:
> Hello!
> This might not be the perfect list for the issue, but I tried user@
> previously with the same issue, but with a bit less information to no
> avail.
> So I'm hoping someone here can point me into the right direction.
> We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of
> our jobs fail, even when the jobs are super simple. For instance: [0]
> We get two kinds of "errors", one where a task is actually marked as
> failed in the web ui [1]. Basically:
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-1464899749-
> file=/data/ia/derivatives/de/links/TA/part-68879.gz
> See link for the stack trace.
> When I check the block via "hdfs fsck -blockId blk_1084837359" all is
> well, I can also `-cat' the data into `wc'. It's a valid GZIP file.
> The other kind of "error" we are getting are [2]:
> DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 
> 2648.1420920453265 msec.
> BlockReaderFactory: I/O error constructing remote block reader.
> Network is unreachable
> DFSClient: Failed to connect to / for block, add to
> deadNodes and continue. Network is unreachable
> These are logged in the stderr of _some_ of the executors.
> I know that both things (at least to me) look more like a problem with
> HDFS and/or CDH. But we tried reading data via mapred jobs that
> essentially just manually opened the GZIP files, read them and printed
> some status info and those didn't produce any kind of error. The only
> thing we noticed was that sometimes the read() call apparently stalled
> for several minutes. But we couldn't identify a cause so far. And we
> also didn't see any errors in the CDH logs except maybe the following
> informational messages:
> Likely the client has stopped reading, disconnecting it 
> (node24.ib:50010:DataXceiver error processing READ_BLOCK operation  src: 
> / dst: /; 
> 120004 millis timeout while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/ 
> remote=/]
> All the systems (masters and nodes) can reach each other on the
> (infiniband) network. The systems communicate only over that one network
> (ie. datanodes only bind to one IP). /etc/hosts files are also the same
> on all systems and were distributed via ansible. But we also have a
> central DNS with the same data (and for PTR resolution) all systems are
> using.
> The cluster has 37 nodes and 2 masters.
> Suggestions are very welcome. :-)
> [0] -
> [1] -
> [2] -
> Best,
>        -jhz
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

To unsubscribe e-mail:

Reply via email to