Maybe contact Oracle support?

Do you have maybe accidentally configured some firewall rules? Routing issues? 
Maybe only one of the nodes...





> On 9. Nov 2017, at 20:04, Jan-Hendrik Zab <z...@l3s.de> wrote:
> 
> 
> Hello!
> 
> This might not be the perfect list for the issue, but I tried user@
> previously with the same issue, but with a bit less information to no
> avail.
> 
> So I'm hoping someone here can point me into the right direction.
> 
> We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of
> our jobs fail, even when the jobs are super simple. For instance: [0]
> 
> We get two kinds of "errors", one where a task is actually marked as
> failed in the web ui [1]. Basically:
> 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-1464899749-10.10.0.1-1378382027840:blk_1084837359_1099539407729
> file=/data/ia/derivatives/de/links/TA/part-68879.gz
> 
> See link for the stack trace.
> 
> When I check the block via "hdfs fsck -blockId blk_1084837359" all is
> well, I can also `-cat' the data into `wc'. It's a valid GZIP file.
> 
> The other kind of "error" we are getting are [2]:
> 
> DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 
> 2648.1420920453265 msec.
> BlockReaderFactory: I/O error constructing remote block reader.
> java.net.SocketException: Network is unreachable
> DFSClient: Failed to connect to /10.12.1.26:50010 for block, add to
> deadNodes and continue. java.net.SocketException: Network is unreachable
> 
> These are logged in the stderr of _some_ of the executors.
> 
> I know that both things (at least to me) look more like a problem with
> HDFS and/or CDH. But we tried reading data via mapred jobs that
> essentially just manually opened the GZIP files, read them and printed
> some status info and those didn't produce any kind of error. The only
> thing we noticed was that sometimes the read() call apparently stalled
> for several minutes. But we couldn't identify a cause so far. And we
> also didn't see any errors in the CDH logs except maybe the following
> informational messages:
> 
> Likely the client has stopped reading, disconnecting it 
> (node24.ib:50010:DataXceiver error processing READ_BLOCK operation  src: 
> /10.12.1.20:46518 dst: /10.12.1.24:50010); java.net.SocketTimeoutException: 
> 120004 millis timeout while waiting for channel to be ready for write. ch : 
> java.nio.channels.SocketChannel[connected local=/10.12.1.24:50010 
> remote=/10.12.1.20:46518]
> 
> All the systems (masters and nodes) can reach each other on the
> (infiniband) network. The systems communicate only over that one network
> (ie. datanodes only bind to one IP). /etc/hosts files are also the same
> on all systems and were distributed via ansible. But we also have a
> central DNS with the same data (and for PTR resolution) all systems are
> using.
> 
> The cluster has 37 nodes and 2 masters.
> 
> Suggestions are very welcome. :-)
> 
> [0] - http://www.l3s.de/~zab/link_converter.scala
> [1] - http://www.l3s.de/~zab/spark-errors-2.txt
> [2] - http://www.l3s.de/~zab/spark-errors.txt
> 
> Best,
>        -jhz
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to