Re: Spark job stuck at s3a-file-system metrics system started

Gourav Sengupta Fri, 01 May 2020 07:58:15 -0700

Hi,

I think that we should stop using S3a, and use S3.


Please try refer about EMRFS and how it provides fantastic advantages :)


Regards,
Gourav Sengupta

On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade <ateka...@binghamton.edu>
wrote:

> Hello,
>
> I am trying to run a spark job that is trying to write the data into a
> custom s3 endpoint bucket. But I am stuck at this line of output and job is
> not moving forward at all -
>
> 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
> 20/04/29 16:03:59 INFO SharedState: Warehouse path is 
> 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
> 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 
> 10 second(s).
> 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>
> After long time of waiting it shows this -
>
> org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
> test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP 
> request: Connect to s3-region0.mycloud.com:443 
> [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection 
> refused): Unable to execute HTTP request: Connect to 
> s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: 
> Connection refused (Connection refused)
>
> However, I am able to access this bucket from aws cli from the same
> machine. I don't understand why it is saying not able to execute the HTTP
> request.
>
> I am using -
>
> spark               3.0.0-preview2
> hadoop-aws          3.2.0
> aws-java-sdk-bundle 1.11.375
>
> My spark code has following properties set for hadoop configuration -
>
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
>
> Can someone help me in understanding what is wrong here? Is there anything
> else I need to configure. The custom s3-endpoint and its keys are valid and
> working from aws cli profile. What is wrong with the scala code here?
>
> val dataStreamWriter: DataStreamWriter[Row] = 
> PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) 
> as "month", year(current_date()) as "year")
>       .writeStream
>       .format("parquet")
>       .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
>       .outputMode("append")
>       .trigger(Trigger.ProcessingTime("15 seconds"))
>       .partitionBy("year", "month", "day")
>       .option("path", "s3a://test-bucket")
>
> val streamingQuery: StreamingQuery = dataStreamWriter.start()
>
> Aniruddha
> -----------
> ᐧ
>

Re: Spark job stuck at s3a-file-system metrics system started

Reply via email to