[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081374#comment-16081374 ]
Dmitry Demeshchuk commented on BEAM-2572: ----------------------------------------- How about the following plan, then? 1. Add an ability to hide pipeline options. For example, extend {{_BeamArgumentParser}} by overloading the {{add_argument}} method, adding a {{hidden=False}} parameter there. 2. Add an {{AWSOptions}} class that inherits from {{PipelineOptions}} and provides hidden options {{aws_access_key_id}}, {{aws_secret_access_key}} and {{aws_default_region}}. 3. Add an AWS extra package to {{apache_beam}} (similar to {{apache_beam[gcp]}}), which depends on boto and contains all the AWS-related code. 4. Add an ability for filesystems to be aware of the pipeline options. 5. Add the actual S3 filesystem. I can make the corresponding tickets and start working on them. > Implement an S3 filesystem for Python SDK > ----------------------------------------- > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py > Reporter: Dmitry Demeshchuk > Assignee: Ahmet Altay > Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)