Hi, I've just read your message. Have you resolved the problem ? If not, what is the contents of /etc/hosts ?
On Mon, Dec 19, 2016 at 10:09 PM, Michael Stratton < michael.strat...@komodohealth.com> wrote: > I don't think the issue is an empty partition, but it may not hurt to try > a repartition prior to writing just to rule it out due to the premature EOF > exception. > > On Mon, Dec 19, 2016 at 1:53 PM, Joseph Naegele < > jnaeg...@grierforensics.com> wrote: > >> Thanks Michael, hdfs dfsadmin -report tells me: >> >> >> >> Configured Capacity: 7999424823296 (7.28 TB) >> >> Present Capacity: 7997657774971 (7.27 TB) >> >> DFS Remaining: 7959091768187 (7.24 TB) >> >> DFS Used: 38566006784 (35.92 GB) >> >> DFS Used%: 0.48% >> >> Under replicated blocks: 0 >> >> Blocks with corrupt replicas: 0 >> >> Missing blocks: 0 >> >> Missing blocks (with replication factor 1): 0 >> >> >> >> ------------------------------------------------- >> >> Live datanodes (1): >> >> >> >> Name: 127.0.0.1:50010 (localhost) >> >> Hostname: XXX.XXX.XXX >> >> Decommission Status : Normal >> >> Configured Capacity: 7999424823296 (7.28 TB) >> >> DFS Used: 38566006784 (35.92 GB) >> >> Non DFS Used: 1767048325 (1.65 GB) >> >> DFS Remaining: 7959091768187 (7.24 TB) >> >> DFS Used%: 0.48% >> >> DFS Remaining%: 99.50% >> >> Configured Cache Capacity: 0 (0 B) >> >> Cache Used: 0 (0 B) >> >> Cache Remaining: 0 (0 B) >> >> Cache Used%: 100.00% >> >> Cache Remaining%: 0.00% >> >> Xceivers: 17 >> >> Last contact: Mon Dec 19 13:00:06 EST 2016 >> >> >> >> The Hadoop exception occurs because it times out after 60 seconds in a >> “select” call on a java.nio.channels.SocketChannel, while waiting to >> read from the socket. This implies the client writer isn’t writing on the >> socket as expected, but shouldn’t this all be handled by the Hadoop library >> within Spark? >> >> >> >> It looks like a few similar, but rare, cases have been reported before, >> e.g. https://issues.apache.org/jira/browse/HDFS-770 which is *very* old. >> >> >> >> If you’re pretty sure Spark couldn’t be responsible for issues at this >> level I’ll stick to the Hadoop mailing list. >> >> >> >> Thanks >> >> --- >> >> Joe Naegele >> >> Grier Forensics >> >> >> >> *From:* Michael Stratton [mailto:michael.strat...@komodohealth.com] >> *Sent:* Monday, December 19, 2016 10:00 AM >> *To:* Joseph Naegele <jnaeg...@grierforensics.com> >> *Cc:* user <user@spark.apache.org> >> *Subject:* Re: [Spark SQL] Task failed while writing rows >> >> >> >> It seems like an issue w/ Hadoop. What do you get when you run hdfs >> dfsadmin -report? >> >> >> >> Anecdotally(And w/o specifics as it has been a while), I've generally >> used Parquet instead of ORC as I've gotten a bunch of random problems >> reading and writing ORC w/ Spark... but given ORC performs a lot better w/ >> Hive it can be a pain. >> >> >> >> On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele < >> jnaeg...@grierforensics.com> wrote: >> >> Hi all, >> >> I'm having trouble with a relatively simple Spark SQL job. I'm using >> Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per >> record). It's current compressed size is around 13 GB, but my problem >> started when it was much smaller, maybe 5 GB. This dataset is generated by >> performing a query on an existing ORC dataset in HDFS, selecting a subset >> of the existing data (i.e. removing duplicates). When I write this dataset >> to HDFS using ORC I get the following exceptions in the driver: >> >> org.apache.spark.SparkException: Task failed while writing rows >> Caused by: java.lang.RuntimeException: Failed to commit task >> Suppressed: java.lang.IllegalArgumentException: Column has wrong number >> of index entries found: 0 expected: 32 >> >> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. >> Aborting... >> >> This happens multiple times. The executors tell me the following a few >> times before the same exceptions as above: >> >> >> >> 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output >> committer class org.apache.hadoop.mapreduce.li >> b.output.FileOutputCommitter >> >> 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream >> ResponseProcessor exception for block BP-1695049761-192.168.2.211-14 >> 79228275669:blk_1073862425_121642 >> >> java.io.EOFException: Premature EOF: no length prefix available >> >> at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed( >> PBHelper.java:2203) >> >> at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck. >> readFields(PipelineAck.java:176) >> >> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$Response >> Processor.run(DFSOutputStream.java:867) >> >> >> My HDFS datanode says: >> >> 2016-12-09 02:39:24,783 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: >> src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: >> HDFS_WRITE, cliID: DFSClient_attempt_201612090102 >> _0000_m_000025_0_956624542_193, offset: 0, srvID: >> 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: >> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, >> duration: 93026972 >> >> 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: >> PacketResponder: BP-1695049761-192.168.2.211-14 >> 79228275669:blk_1073862420_121637, type=LAST_IN_PIPELINE, >> downstreams=0:[] terminating >> >> 2016-12-09 02:39:49,262 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK >> operation src: /127.0.0.1:57790 dst: /127.0.0.1:50010 >> >> java.net.SocketTimeoutException: 60000 millis timeout while waiting for >> channel >> to be ready for read. ch : java.nio.channels.SocketChannel[connected >> local=/127.0.0.1:50010 remote=/127.0.0.1:57790] >> >> >> It looks like the datanode is receiving the block on multiple ports >> (threads?) and one of the sending connections terminates early. >> >> I was originally running 6 executors with 6 cores and 24 GB RAM each >> (Total: 36 cores, 144 GB) and experienced many of these issues, where >> occasionally my job would fail altogether. Lowering the number of cores >> appears to reduce the frequency of these errors, however I'm now down to 4 >> executors with 2 cores each (Total: 8 cores), which is significantly less, >> and still see approximately 1-3 task failures. >> >> Details: >> - Spark 1.6.3 - Standalone >> - RDD compression enabled >> - HDFS replication disabled >> - Everything running on the same host >> - Otherwise vanilla configs for Hadoop and Spark >> >> Does anybody have any ideas or hints? I can't imagine the problem is >> solely related to the number of executor cores. >> >> Thanks, >> Joe Naegele >> >> >> > >