Docs/guidelines on writing filesystem sources and sinks

Dmitry Demeshchuk Thu, 06 Jul 2017 13:27:56 -0700

Hi folks,

I'm working on an S3 filesystem for the Python SDK, which already works in
case of a happy path for both reading and writing, but I feel like there
are quite a few edge cases that I'm likely missing.


So far, my approach has been: "look at the generic FileSystem
implementation, look at how gcsio.py and gcsfilesystem.py are written, try
to copy their approach as much as possible, at least for getting to the
proof of concept".

That said, I'd like to know a few things:

1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.

2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work? I currently rely on the
existing environment variables, which boto just picks up, but sounds like
setting them up in runners like Dataflow or Spark would be troublesome.
I've seen this discussion a couple times in the list, but couldn't tell if
any closure was found. My personal preference would be having AWS settings
passed in some global context (pipeline options, perhaps?), but there may
be exceptions to that (say, people want to use different credentials for
different AWS operations).

Thanks!

-- 
Best regards,
Dmitry Demeshchuk.

Docs/guidelines on writing filesystem sources and sinks

Reply via email to