Re: Spark job stuck at s3a-file-system metrics system started
Hi, I think that we should stop using S3a, and use S3. Please try refer about EMRFS and how it provides fantastic advantages :) Regards, Gourav Sengupta On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade wrote: > Hello, > > I am trying to run a spark job that is trying to write the data into a > custom s3 endpoint bucket. But I am stuck at this line of output and job is > not moving forward at all - > > 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'). > 20/04/29 16:03:59 INFO SharedState: Warehouse path is > 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'. > 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at > 10 second(s). > 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system > started > > After long time of waiting it shows this - > > org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on > test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP > request: Connect to s3-region0.mycloud.com:443 > [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection > refused): Unable to execute HTTP request: Connect to > s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: > Connection refused (Connection refused) > > However, I am able to access this bucket from aws cli from the same > machine. I don't understand why it is saying not able to execute the HTTP > request. > > I am using - > > spark 3.0.0-preview2 > hadoop-aws 3.2.0 > aws-java-sdk-bundle 1.11.375 > > My spark code has following properties set for hadoop configuration - > > spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") > > Can someone help me in understanding what is wrong here? Is there anything > else I need to configure. The custom s3-endpoint and its keys are valid and > working from aws cli profile. What is wrong with the scala code here? > > val dataStreamWriter: DataStreamWriter[Row] = > PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) > as "month", year(current_date()) as "year") > .writeStream > .format("parquet") > .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/") > .outputMode("append") > .trigger(Trigger.ProcessingTime("15 seconds")) > .partitionBy("year", "month", "day") > .option("path", "s3a://test-bucket") > > val streamingQuery: StreamingQuery = dataStreamWriter.start() > > Aniruddha > --- > ᐧ >
Re: Spark job stuck at s3a-file-system metrics system started
Hi there, Read your question and I do believe you are on right path. But what could be worth checking is - are you able to connect to s3 bucket from your worker nodes. I did read that you are able to do it from your machine but since write happens at the the worker end, it might be worth checking the connection from there. Best, Shobhit G -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark job stuck at s3a-file-system metrics system started
Hello, I am trying to run a spark job that is trying to write the data into a custom s3 endpoint bucket. But I am stuck at this line of output and job is not moving forward at all - 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'). 20/04/29 16:03:59 INFO SharedState: Warehouse path is 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'. 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system started After long time of waiting it shows this - org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused) However, I am able to access this bucket from aws cli from the same machine. I don't understand why it is saying not able to execute the HTTP request. I am using - spark 3.0.0-preview2 hadoop-aws 3.2.0 aws-java-sdk-bundle 1.11.375 My spark code has following properties set for hadoop configuration - spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT); spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY); spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY); spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") Can someone help me in understanding what is wrong here? Is there anything else I need to configure. The custom s3-endpoint and its keys are valid and working from aws cli profile. What is wrong with the scala code here? val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) as "month", year(current_date()) as "year") .writeStream .format("parquet") .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/") .outputMode("append") .trigger(Trigger.ProcessingTime("15 seconds")) .partitionBy("year", "month", "day") .option("path", "s3a://test-bucket") val streamingQuery: StreamingQuery = dataStreamWriter.start() Aniruddha --- ᐧ