Maybe contact Oracle support? Do you have maybe accidentally configured some firewall rules? Routing issues? Maybe only one of the nodes...
> On 9. Nov 2017, at 20:04, Jan-Hendrik Zab <z...@l3s.de> wrote: > > > Hello! > > This might not be the perfect list for the issue, but I tried user@ > previously with the same issue, but with a bit less information to no > avail. > > So I'm hoping someone here can point me into the right direction. > > We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of > our jobs fail, even when the jobs are super simple. For instance: [0] > > We get two kinds of "errors", one where a task is actually marked as > failed in the web ui [1]. Basically: > > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: > BP-1464899749-10.10.0.1-1378382027840:blk_1084837359_1099539407729 > file=/data/ia/derivatives/de/links/TA/part-68879.gz > > See link for the stack trace. > > When I check the block via "hdfs fsck -blockId blk_1084837359" all is > well, I can also `-cat' the data into `wc'. It's a valid GZIP file. > > The other kind of "error" we are getting are [2]: > > DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for > 2648.1420920453265 msec. > BlockReaderFactory: I/O error constructing remote block reader. > java.net.SocketException: Network is unreachable > DFSClient: Failed to connect to /10.12.1.26:50010 for block, add to > deadNodes and continue. java.net.SocketException: Network is unreachable > > These are logged in the stderr of _some_ of the executors. > > I know that both things (at least to me) look more like a problem with > HDFS and/or CDH. But we tried reading data via mapred jobs that > essentially just manually opened the GZIP files, read them and printed > some status info and those didn't produce any kind of error. The only > thing we noticed was that sometimes the read() call apparently stalled > for several minutes. But we couldn't identify a cause so far. And we > also didn't see any errors in the CDH logs except maybe the following > informational messages: > > Likely the client has stopped reading, disconnecting it > (node24.ib:50010:DataXceiver error processing READ_BLOCK operation src: > /10.12.1.20:46518 dst: /10.12.1.24:50010); java.net.SocketTimeoutException: > 120004 millis timeout while waiting for channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.12.1.24:50010 > remote=/10.12.1.20:46518] > > All the systems (masters and nodes) can reach each other on the > (infiniband) network. The systems communicate only over that one network > (ie. datanodes only bind to one IP). /etc/hosts files are also the same > on all systems and were distributed via ansible. But we also have a > central DNS with the same data (and for PTR resolution) all systems are > using. > > The cluster has 37 nodes and 2 masters. > > Suggestions are very welcome. :-) > > [0] - http://www.l3s.de/~zab/link_converter.scala > [1] - http://www.l3s.de/~zab/spark-errors-2.txt > [2] - http://www.l3s.de/~zab/spark-errors.txt > > Best, > -jhz > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org