Hi folks, I'm working on an S3 filesystem for the Python SDK, which already works in case of a happy path for both reading and writing, but I feel like there are quite a few edge cases that I'm likely missing.
So far, my approach has been: "look at the generic FileSystem implementation, look at how gcsio.py and gcsfilesystem.py are written, try to copy their approach as much as possible, at least for getting to the proof of concept". That said, I'd like to know a few things: 1. Are there any official or non-official guidelines or docs on writing filesystems? Even Java-specific ones may be really useful. 2. Are there any existing generic test suites that every filesystem is supposed to pass? Again, even if they exist only in Java world, I'd still be down for trying to adopt them in Python SDK too. 3. Are there any established ideas of how to pass AWS credentials to Beam for making the S3 filesystem actually work? I currently rely on the existing environment variables, which boto just picks up, but sounds like setting them up in runners like Dataflow or Spark would be troublesome. I've seen this discussion a couple times in the list, but couldn't tell if any closure was found. My personal preference would be having AWS settings passed in some global context (pipeline options, perhaps?), but there may be exceptions to that (say, people want to use different credentials for different AWS operations). Thanks! -- Best regards, Dmitry Demeshchuk.
