from:"\"Matt Rudary\""

Proposal: Generalize S3FileSystem

2021-05-19 Thread Matt Rudary

Hi,

This is a quick sketch of a proposal - I wanted to get a sense of whether 
there's general support for this idea before fleshing it out further, getting 
internal approvals, etc.

I'm working with multiple storage systems that speak the S3 api. I would like 
to support FileIO operations for these storage systems, but S3FileSystem 
hardcodes the s3 scheme (the various systems use different URI schemes) and it 
is in any case impossible to instantiate more than one in the current design.

I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
...aws.options) somewhat to enable this use-case. I haven't worked out the 
details yet, but it will take some thought to make this work in a non-hacky way.

Thanks
Matt Rudary

Re: Proposal: Generalize S3FileSystem

2021-05-24 Thread Matt Rudary

Thanks for the comments all. I forgot to subscribe to dev before I sent out the 
email, so this response isn't threaded properly.

My proposed design is to do the following (for both aws and aws2 packages):

1.   Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2.   Add a public interface, S3FileSystemSchemeRegistrar, designed for use 
with AutoService. It will have a method that takes a PipelineOptions and 
returns an Iterable of S3FileSystemConfiguration. This will be the way that 
users register their S3 uri schemes with the system.

3.   Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme 
that uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4.   Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5.   Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.

I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.

Matt

On 2021/05/19 13:27:16, Matt Rudary 
mailto:m...@twosigma.com>> wrote:

> Hi,>

>

> This is a quick sketch of a proposal - I wanted to get a sense of whether 
> there's general support for this idea before fleshing it out further, getting 
> internal approvals, etc.>

>

> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.>

>

> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.>

>

> Thanks>

> Matt Rudary>

>

RE: Proposal: Generalize S3FileSystem

2021-06-01 Thread Matt Rudary

I've filed https://issues.apache.org/jira/browse/BEAM-12435 to track this 
improvement.

From: Matt Rudary 
Sent: Monday, May 24, 2021 4:49 PM
To: dev@beam.apache.org
Subject: Re: Proposal: Generalize S3FileSystem

Thanks for the comments all. I forgot to subscribe to dev before I sent out the 
email, so this response isn't threaded properly.

My proposed design is to do the following (for both aws and aws2 packages):

1.   Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2.   Add a public interface, S3FileSystemSchemeRegistrar, designed for use 
with AutoService. It will have a method that takes a PipelineOptions and 
returns an Iterable of S3FileSystemConfiguration. This will be the way that 
users register their S3 uri schemes with the system.

3.   Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme 
that uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4.   Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5.   Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.

I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.

Matt

On 2021/05/19 13:27:16, Matt Rudary 
mailto:m...@twosigma.com>> wrote:

> Hi,>

>

> This is a quick sketch of a proposal - I wanted to get a sense of whether 
> there's general support for this idea before fleshing it out further, getting 
> internal approvals, etc.>

>

> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.>

>

> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.>

>

> Thanks>

> Matt Rudary>

>

Modifying serializable classes

2021-06-03 Thread Matt Rudary

My general question is what responsibility we have to maintain forward and 
backward compatibility for serialization of objects in the SDK. My specific 
question is about org.apache.beam.sdk.io.aws.s3.S3ResourceId - how can I tell 
whether ResourceIds are serialized anywhere that would require stable 
serialization across Beam SDK updates?

Thanks

RE: Synchronization of RestrictionTrackers

2021-07-29 Thread Matt Rudary

In practice, restriction tracker methods are called via a 
RestrictionTrackerObserver, which synchronizes 
(https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java)

-Original Message-
From: Jan Lukavský  
Sent: Thursday, July 29, 2021 1:59 PM
To: dev@beam.apache.org
Subject: Synchronization of RestrictionTrackers

Hi,

I have come across something that looks like a bug to me, but I'm not sure of 
that. If I understand it correctly,
RestrictionTracker.trySplit() and RestrictionTracker.tryClaim() methods are 
necessarily called from different threads. That implies, that modifying some 
fields inside these methods might require synchronization. Looking here [1], I 
didn't find anything that should ensure atomicity and consistency of these 
methods. If anything I'd expect the lastClaimedOffset and lastAttemptedOffset 
be volatile. But probably the problem is deeper. Is this a bug, or am I missing 
something?

  Jan

[1]
https://github.com/apache/beam/blob/939fa99ce943a30da46cb3d67c924d524fbf1be4/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/OffsetRangeTracker.java#L44

Developing on an M1 Mac

2022-01-12 Thread Matt Rudary

Does anyone do Beam development on an M1 Mac? Any tips to getting things up and 
running?

Alternatively, does anyone have a good "workstation in the cloud" setup?

Thanks
Matt

Proposal: Generalize S3FileSystem

Re: Proposal: Generalize S3FileSystem

RE: Proposal: Generalize S3FileSystem

Modifying serializable classes

RE: Synchronization of RestrictionTrackers

Developing on an M1 Mac

6 matches

Site Navigation

Mail list logo

Footer information