Hi maqy, The exception is occurd by connection closed,one of reasons is
datanode side timeout if We have not find problem In spark before the
exception.So We could try to find more clues In datanode log. Best
wishes, Jinxin xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制
在2020年04月22日 23:40,maqy 写道: Hi Jinxin, spark web ui shows that all tasks
are completed successfully, this error appears in the shell:
java.io.EOFException: Premature EOF: no length prefix available at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
More information can be seen here:
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
I speculate that there is a problem with deserialization, because after the
web ui shows that the tasks of collect() are completed, the memory occupied by
the “spark submit” process is still increasing. After a few minutes, the memory
usage will no longer increase, and after a few minutes, the shell will report
this error. Best regards, maqy 发件人: Tang Jinxin 发送时间: 2020年4月22日 23:16 收件人:
maqy 抄送: user@spark.apache.org 主题: 回复:[Spark SQL] [Beginner] Dataset[Row]
collect to driver throwjava.io.EOFException: Premature EOF: no length prefix
available Maybe datanode stop data transfer due to timeout.Could you
please provide exception stack? xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由
网易邮箱大师 定制 在2020年04月22日 19:53,maqy 写道: Today I meet the same problem using
rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will
appear when the amount of data reaches about 100GB. I guess there may be
something wrong with deserialization. Has anyone else encountered this problem?
Best regards, maqy 发件人: maqy1...@outlook.com 发送时间: 2020年4月20日 10:33 收件人:
user@spark.apache.org 主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver
throwjava.io.EOFException: Premature EOF: no length prefix available Hi all,
I get a Dataset[Row] through the following code: val df: Dataset[Row] =
spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
After that I want to collect it to the driver: val df_rows: Array[Row] =
df.collect() The Spark web ui shows that all tasks have run successfully, but
the application did not stop. After more than ten minutes, an error will be
generated in the shell: java.io.EOFException: Premature EOF: no length prefix
available Environment: Spark 2.4.3 Hadoop 2.7.7 Total rows of
data about 800,000,000, 12GB More detailed information can be seen
here:
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
Does anyone know the reason? Best regards, maqy