[ https://issues.apache.org/jira/browse/HADOOP-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631642#comment-17631642 ]
Sébastien Burton commented on HADOOP-18523: ------------------------------------------- Hello [~ste...@apache.org], thank you for your answer! :) We'll discuss that internally and see how to go from there. Regards, Sébastien > Allow to retrieve an object from MinIO (S3 API) with a very restrictive policy > ------------------------------------------------------------------------------ > > Key: HADOOP-18523 > URL: https://issues.apache.org/jira/browse/HADOOP-18523 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 > Reporter: Sébastien Burton > Priority: Major > > Hello, > We're using Spark > ({{{}"org.apache.spark:spark-[catalyst|core|sql]_2.12:3.2.2"{}}}) and Hadoop > ({{{}"org.apache.hadoop:hadoop-common:3.3.3"{}}}) and want to retrieve an > object stored in a MinIO bucket (MinIO implements the S3 API). Spark relies > on Hadoop for this operation. > The MinIO bucket (that we don't manage) is configured with a very restrictive > policy that only allows us to retrieve the object (and nothing else). > Something like: > {code:java} > { > "statement": [ > { > "effect": "Allow", > "action": [ "s3:GetObject" ], > "resource": [ "arn:aws:s3:::minio-bucket/object" ] > } > ] > }{code} > And using the AWS CLI, we can well retrieve the object. > When we try with Spark's {{{}DataFrameReader{}}}, we receive an HTTP 403 > response (access denied) from MinIO: > {code:java} > java.nio.file.AccessDeniedException: s3a://minio-bucket/object: getFileStatus > on s3a://minio-bucket/object: > com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied. (Service: > Amazon S3; Status Code: 403; Error Code: AccessDenied; ... > at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255) > at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3858) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$isDirectory$35(S3AFileSystem.java:4724) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4722) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:481) > at > com.soprabanking.dxp.pure.bf.dataaccess.S3Storage.loadDataset(S3Storage.java:55) > at > com.soprabanking.dxp.pure.bf.business.step.DatasetLoader.lambda$doLoad$3(DatasetLoader.java:148) > at > reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:125) > at > reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816) > at > reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151) > at > reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816) > at > reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249) > at > reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816) > at reactor.core.publisher.MonoZip$ZipCoordinator.signal(MonoZip.java:251) > at reactor.core.publisher.MonoZip$ZipInner.onNext(MonoZip.java:336) > at > reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2398) > at reactor.core.publisher.MonoZip$ZipInner.onSubscribe(MonoZip.java:325) > at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55) > at reactor.core.publisher.Mono.subscribe(Mono.java:4400) > at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128) > at > reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157) > at > reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74) > at > reactor.core.publisher.FluxFilter$FilterSubscriber.onNext(FluxFilter.java:113) > at > reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74) > at > reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:118) > at > reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180) > at > reactor.core.publisher.FluxPeekFuseable$PeekFuseableConditionalSubscriber.onNext(FluxPeekFuseable.java:503) > at > reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137) > at > reactor.core.publisher.Operators$MonoInnerProducerBase.complete(Operators.java:2664) > at > reactor.core.publisher.MonoSingle$SingleSubscriber.onComplete(MonoSingle.java:180) > at > com.jakewharton.retrofit2.adapter.reactor.BodyFlux$BodySubscriber.onComplete(BodyFlux.java:80) > at > reactor.core.publisher.StrictSubscriber.onComplete(StrictSubscriber.java:123) > at > reactor.core.publisher.FluxCreate$BaseSink.complete(FluxCreate.java:439) > at > reactor.core.publisher.FluxCreate$LatestAsyncSink.drain(FluxCreate.java:945) > at > reactor.core.publisher.FluxCreate$LatestAsyncSink.complete(FluxCreate.java:892) > at > reactor.core.publisher.FluxCreate$SerializedFluxSink.drainLoop(FluxCreate.java:240) > at > reactor.core.publisher.FluxCreate$SerializedFluxSink.drain(FluxCreate.java:206) > at > reactor.core.publisher.FluxCreate$SerializedFluxSink.complete(FluxCreate.java:197) > at > com.jakewharton.retrofit2.adapter.reactor.EnqueueSinkConsumer$DisposableCallback.onResponse(EnqueueSinkConsumer.java:52) > at retrofit2.OkHttpCall$1.onResponse(OkHttpCall.java:161) > at > brave.okhttp3.TraceContextCall$TraceContextCallback.onResponse(TraceContextCall.java:95) > at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source){code} > The credentials are well set but under the hood Hadoop calls MinIO to check > whether the object is a directory (which we don't want), and this results in > a failure. > We can well retrieve the object by changing MinIO's policy - but this isn't > an option to us - to something like: > {code:java} > { > "statement": [ > { > "effect": "Allow", > "action": [ "s3:GetObject" ], > "resource": [ "arn:aws:s3:::minio-bucket/object" ] > }, > { > "effect": "Allow", > "action": [ "s3:ListBucket" ], > "resource": [ "arn:aws:s3:::minio-bucket/" ], > "condition": { > "StringLike": { > "s3:prefix": [ "object", "object/" ] > } > } > } > ] > }{code} > We couldn't find any way to configure Hadoop so that it just attempts to > retrieve the object. Reading HADOOP-17454, it feels like it could be possible > to provide options to fine-tune Hadoop's behaviour. > Are there such options? If not, is it a reasonable behaviour to put in place? > Regards, > Sébastien > Please note this is my first time here: I hope I picked the right project, > issue type and priority (I tried my best looking around). If not, I'm very > sorry about that. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org