Spark job stuck at s3a-file-system metrics system started

Aniruddha P Tekade Wed, 29 Apr 2020 16:54:21 -0700

Hello,

I am trying to run a spark job that is trying to write the data into a
custom s3 endpoint bucket. But I am stuck at this line of output and job is
not moving forward at all -


20/04/29 16:03:59 INFO SharedState: Setting
hive.metastore.warehouse.dir ('null') to the value of
spark.sql.warehouse.dir
('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
20/04/29 16:03:59 INFO SharedState: Warehouse path is
'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration:
tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot
period at 10 second(s).
20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system started

After long time of waiting it shows this -

org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on
test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP
request: Connect to s3-region0.mycloud.com:443
[s3-region0.mycloud.com/10.10.3.72] failed: Connection refused
(Connection refused): Unable to execute HTTP request: Connect to
s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed:
Connection refused (Connection refused)

However, I am able to access this bucket from aws cli from the same
machine. I don't understand why it is saying not able to execute the HTTP
request.

I am using -

spark               3.0.0-preview2
hadoop-aws          3.2.0
aws-java-sdk-bundle 1.11.375

My spark code has following properties set for hadoop configuration -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

Can someone help me in understanding what is wrong here? Is there anything
else I need to configure. The custom s3-endpoint and its keys are valid and
working from aws cli profile. What is wrong with the scala code here?

val dataStreamWriter: DataStreamWriter[Row] =
PM25quality.select(dayofmonth(current_date()) as "day",
month(current_date()) as "month", year(current_date()) as "year")
      .writeStream
      .format("parquet")
      .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("15 seconds"))
      .partitionBy("year", "month", "day")
      .option("path", "s3a://test-bucket")

val streamingQuery: StreamingQuery = dataStreamWriter.start()

Aniruddha
-----------
ᐧ

Spark job stuck at s3a-file-system metrics system started

Reply via email to