cj-zhukov commented on PR #18747:
URL: https://github.com/apache/datafusion/pull/18747#issuecomment-3569072227

   > I tried it out locally
   > 
   > ```shell
   > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run 
--example external_dependency -- dataframe_to_s3
   >     Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.33s
   >      Running `target/debug/examples/external_dependency dataframe_to_s3`
   > 
   > thread 'main' (45830553) panicked at 
datafusion-examples/examples/external_dependency/dataframe_to_s3.rs:51:59:
   > called `Result::unwrap()` on an `Err` value: NotPresent
   > note: run with `RUST_BACKTRACE=1` environment variable to display a 
backtrace
   > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run 
--example external_dependency -- query_aws_s3
   >     Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.19s
   >      Running `target/debug/examples/external_dependency query_aws_s3`
   > 
   > thread 'main' (45831058) panicked at 
datafusion-examples/examples/external_dependency/query_aws_s3.rs:45:59:
   > called `Result::unwrap()` on an `Err` value: NotPresent
   > note: run with `RUST_BACKTRACE=1` environment variable to display a 
backtrace
   > ```
   > 
   > I think the failures are due to the fact I don't have AWS seetup
   
   Good catch - and thank you for testing this.
   
   After digging into this further (including the discussion in 
awslabs/open-data-registry#1418), it turns out the `s3://nyc-tlc` bucket no 
longer allows anonymous access. Both `ListBucket` and `GetObject` now require 
credentials, which explains why:
   ```bash
   aws s3 ls s3://nyc-tlc/ --no-sign-request 
   ```
   - Fails with AccessDenied
   - Your local run panics with NotPresent
   - My machine sees the same behavior
   
   I’m also not fully sure why the example used to work - the best guess is 
that CI was running it with temporary AWS credentials in the environment, which 
made the requests signed, and AWS allowed them at the time. But since the 
bucket now rejects anonymous access entirely, relying on it is no longer viable.
   
   Even if it starts working again, it’s outside our control. If it changes 
permissions again (as it just did), we’ll silently break the example for users. 
So I suggest we stop depending on nyc-tlc altogether and instead:
   - Use a user-controlled bucket for the example (as implemented in this PR).
   This avoids relying on external datasets with changing policies. Also update 
the docs and inline comments to clearly explain that: users must provide their 
own S3 bucket and Parquet file and the example expects valid AWS credentials to 
be configured
   - Add a comment + README note explaining why the example can’t use NYC TLC 
anymore and link to the GitHub issue for context.
   - Optionally, we could switch to another public dataset - but personally I 
think that’s risky, since we can't guarantee its permissions won’t change in 
the future.
   
   This way, the example will always behave predictably and won’t require users 
to debug `AccessDenied` errors caused by external policy changes.
   
   Let me know what do you think - happy to refine it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to