Hello all,
I'm hoping someone can give me some direction for troubleshooting this issue,
I'm trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh
directly to the first datanode and run PySpark with the following command;
however, it is always failing no matter what size I set memory in Yarn
Containers and Yarn Queues. Any suggestions?
pyspark --conf queue=default --conf executory-memory=24G
--
HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
#HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
HDFS_OUT="/tmp"
ENCODING="utf-16"
fileList1=[
'Test _2003.txt'
]
from pyspark.sql.functions import regexp_replace,col
for f in fileList1:
fname=f
fname_noext=fname.split('.')[0]
df =
spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
header=True)
lastcol=df.columns[-1]
print('showing {}'.format(fname))
if ('\r' in lastcol):
lastcol=lastcol.replace('\r','')
df=df.withColumn(lastcol,
regexp_replace(col("{}\r".format(lastcol)), "[\r]",
"")).drop('{}\r'.format(lastcol))
df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
1.0 (TID 4, DataNode01.mydomain.com, executor 5): ExecutorLostFailure (executor
5 exited caused by one of the running tasks) Reason: Container marked as
failed: container_e331_1621375512548_0021_01_000006 on host:
DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19
18:09:06.392]Container killed on request. Exit code is 143
[2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
[2021-05-19 18:09:06.414]Killed by external signal
THANKS! CLAY