Re: Spark job stuck at s3a-file-system metrics system started

2020-05-01 Thread Gourav Sengupta
Hi,

I think that we should stop using S3a, and use S3.

Please try refer about EMRFS and how it provides fantastic advantages :)


Regards,
Gourav Sengupta

On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade 
wrote:

> Hello,
>
> I am trying to run a spark job that is trying to write the data into a
> custom s3 endpoint bucket. But I am stuck at this line of output and job is
> not moving forward at all -
>
> 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
> 20/04/29 16:03:59 INFO SharedState: Warehouse path is 
> 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
> 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
> 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 
> 10 second(s).
> 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>
> After long time of waiting it shows this -
>
> org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on 
> test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP 
> request: Connect to s3-region0.mycloud.com:443 
> [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection 
> refused): Unable to execute HTTP request: Connect to 
> s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: 
> Connection refused (Connection refused)
>
> However, I am able to access this bucket from aws cli from the same
> machine. I don't understand why it is saying not able to execute the HTTP
> request.
>
> I am using -
>
> spark   3.0.0-preview2
> hadoop-aws  3.2.0
> aws-java-sdk-bundle 1.11.375
>
> My spark code has following properties set for hadoop configuration -
>
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
>
> Can someone help me in understanding what is wrong here? Is there anything
> else I need to configure. The custom s3-endpoint and its keys are valid and
> working from aws cli profile. What is wrong with the scala code here?
>
> val dataStreamWriter: DataStreamWriter[Row] = 
> PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) 
> as "month", year(current_date()) as "year")
>   .writeStream
>   .format("parquet")
>   .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
>   .outputMode("append")
>   .trigger(Trigger.ProcessingTime("15 seconds"))
>   .partitionBy("year", "month", "day")
>   .option("path", "s3a://test-bucket")
>
> val streamingQuery: StreamingQuery = dataStreamWriter.start()
>
> Aniruddha
> ---
> ᐧ
>


Re: Spark job stuck at s3a-file-system metrics system started

2020-04-30 Thread Abhisheks
Hi there,

Read your question and I do believe you are on right path. But what could be
worth checking is - are you able to connect to s3 bucket from your worker
nodes. 

I did read that you are able to do it from your machine but since write
happens at the the worker end, it might be worth checking the connection
from there.

Best,
Shobhit G



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark job stuck at s3a-file-system metrics system started

2020-04-29 Thread Aniruddha P Tekade
Hello,

I am trying to run a spark job that is trying to write the data into a
custom s3 endpoint bucket. But I am stuck at this line of output and job is
not moving forward at all -

20/04/29 16:03:59 INFO SharedState: Setting
hive.metastore.warehouse.dir ('null') to the value of
spark.sql.warehouse.dir
('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
20/04/29 16:03:59 INFO SharedState: Warehouse path is
'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration:
tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot
period at 10 second(s).
20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system started

After long time of waiting it shows this -

org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on
test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP
request: Connect to s3-region0.mycloud.com:443
[s3-region0.mycloud.com/10.10.3.72] failed: Connection refused
(Connection refused): Unable to execute HTTP request: Connect to
s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed:
Connection refused (Connection refused)

However, I am able to access this bucket from aws cli from the same
machine. I don't understand why it is saying not able to execute the HTTP
request.

I am using -

spark   3.0.0-preview2
hadoop-aws  3.2.0
aws-java-sdk-bundle 1.11.375

My spark code has following properties set for hadoop configuration -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

Can someone help me in understanding what is wrong here? Is there anything
else I need to configure. The custom s3-endpoint and its keys are valid and
working from aws cli profile. What is wrong with the scala code here?

val dataStreamWriter: DataStreamWriter[Row] =
PM25quality.select(dayofmonth(current_date()) as "day",
month(current_date()) as "month", year(current_date()) as "year")
  .writeStream
  .format("parquet")
  .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
  .outputMode("append")
  .trigger(Trigger.ProcessingTime("15 seconds"))
  .partitionBy("year", "month", "day")
  .option("path", "s3a://test-bucket")

val streamingQuery: StreamingQuery = dataStreamWriter.start()

Aniruddha
---
ᐧ