Hi, I think that we should stop using S3a, and use S3.
Please try refer about EMRFS and how it provides fantastic advantages :) Regards, Gourav Sengupta On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade <ateka...@binghamton.edu> wrote: > Hello, > > I am trying to run a spark job that is trying to write the data into a > custom s3 endpoint bucket. But I am stuck at this line of output and job is > not moving forward at all - > > 20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'). > 20/04/29 16:03:59 INFO SharedState: Warehouse path is > 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'. > 20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried > hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties > 20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at > 10 second(s). > 20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system > started > > After long time of waiting it shows this - > > org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on > test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP > request: Connect to s3-region0.mycloud.com:443 > [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection > refused): Unable to execute HTTP request: Connect to > s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: > Connection refused (Connection refused) > > However, I am able to access this bucket from aws cli from the same > machine. I don't understand why it is saying not able to execute the HTTP > request. > > I am using - > > spark 3.0.0-preview2 > hadoop-aws 3.2.0 > aws-java-sdk-bundle 1.11.375 > > My spark code has following properties set for hadoop configuration - > > spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY); > spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") > > Can someone help me in understanding what is wrong here? Is there anything > else I need to configure. The custom s3-endpoint and its keys are valid and > working from aws cli profile. What is wrong with the scala code here? > > val dataStreamWriter: DataStreamWriter[Row] = > PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) > as "month", year(current_date()) as "year") > .writeStream > .format("parquet") > .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/") > .outputMode("append") > .trigger(Trigger.ProcessingTime("15 seconds")) > .partitionBy("year", "month", "day") > .option("path", "s3a://test-bucket") > > val streamingQuery: StreamingQuery = dataStreamWriter.start() > > Aniruddha > ----------- > ᐧ >