Re: Configuring file-based transforms with different options

Chamikara Jayalath Fri, 09 Mar 2018 09:47:34 -0800

On Fri, Mar 9, 2018 at 9:24 AM Lukasz Cwik <lc...@google.com> wrote:

> Note that TextIO/... internally use FileSystems (Java and Python).
>
> Based upon the current design where FileSystems is a global concept
> (decoupled from PTransforms), having PipelineOptions configure it is a good
> and valid strategy.
>
> Earlier work by Pei He and Daniel Halperin was towards having different
> file system configurations fall under different URI schemes (see
> https://issues.apache.org/jira/browse/BEAM-59 for links to tasks and
> design docs).
> So if you wanted to have two different S3 configurations, you should
> register the S3FileSystem object with different configurations under
> different schemes like s3a://, s3b://, and s3c://
> Ismael, your correct in pointing out that HadoopFileSystemOptions
> supports multiple schemes and this is because Hadoop already does what is
> described above where you can have arbitrary configurations under different
> schemes. So the implementation within the HadoopFileSystem merges nicely
> into what was the original plan for FileSystems.
>


I like this approach since it keeps FileSystem configurations decoupled
from Transforms that use FileSystems and keep it global. This is not an
issue in general for GCS since we only need one global configuration to
connect to it but other file-systems such as HDFS may require multiple sets
of configurations (for example if a given pipeline needs to connect to two
different HDFS clusters). Downside is pipeline users will have use the
proper schema when accessing files. But probably this is OK since users
should be aware of the FileSystem they are connecting to.


>
> There is still uncompleted work to make it such that users would be able
> to programmatically add additional file system configurations and to make
> it so that the user's FileSystems configuration is replicated to remote
> workers. Currently only PipelineOpitons is replicated to remote workers in
> all runners which means that you can only use FileSystems configuration via
> PipelineOptions.
>
>
>
> On Fri, Mar 9, 2018 at 8:39 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> File-based transforms are a little bit different because there is a
>> part of the configuration in the file transform (TextIO.read().foo(),
>> TextIO.write().bar()) and other part done in specific filesystem
>> options.
>>
>> In the example TextIO.from(“...”) does not have a way to do something
>> like ‘.withRegion’ (because probably does not make sense to put it
>> there). So you parameterize this in the FileSystem options to make it
>> work. If possible I would have created two independent S3Options
>> objects and passed those to the Read and Write transforms but this is
>> not supported and not the expected PipelineOptions behavior as Robert
>> mentioned.
>>
>> Thinking about other cases I am not sure how can you configure a Read
>> from one HDFS cluster (Configuration) and a Write to a different HDFS
>> one (Note that HadoopFileSystemOptions supports multiple
>> configurations but I have the impression this is to support multiple
>> schemes, not the same goal). I am even wondering if this same issue is
>> valid for GcsOptions and its zones or if I would want to define a
>> different storage class for the written results (extra question: Do we
>> have an option to do this in Gcs?).
>>
>> Eugene, not sure I follow the idea with CreateOptions, what you intend
>> is to use it to parameterize the specific transform? Of how can I get
>> it to pass a specific CreateOptions to the TextIO.read/write so they
>> end up being used by the FileSystem methods that receive a
>> CreateOptions object?
>>
>> On Fri, Mar 9, 2018 at 6:52 AM, Robert Bradshaw <rober...@google.com>
>> wrote:
>> > On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov <kirpic...@google.com>
>> > wrote:
>> >>
>> >> I think it may have been an API design mistake to put the S3 region
>> into
>> >> PipelineOptions.
>> >
>> >
>> > +1, IMHO it's generally a mistake to put any transform configuration
>> into
>> > PipelineOptions for exactly this reason.
>> >
>> >>
>> >> PipelineOptions are global per pipeline, whereas it's totally
>> reasonable
>> >> to access S3 files in different regions even from the code of a single
>> DoFn
>> >> running on a single element. The same applies to "setS3StorageClass".
>> >>
>> >> Jacob: what do you think? Why is it necessary to specify the S3 region
>> at
>> >> all - can AWS infer it automatically? Per
>> >> https://github.com/aws/aws-sdk-java/issues/1107 it seems that this is
>> >> possible via a setting on the client, so that the specified region is
>> used
>> >> as the default but if the bucket is in a different region things still
>> work.
>> >>
>> >> As for the storage class: so far nobody complained ;) but it should
>> >> probably be specified via
>> >>
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/CreateOptions.java
>> >> instead of a pipeline option.
>> >>
>> >> On Thu, Mar 8, 2018 at 9:16 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com>
>> >> wrote:
>> >>>
>> >>> The "hint" would probably to use hints :) - indees this joke refers to
>> >>> the hint thread.
>> >>>
>> >>> Long story short with hints you should be able to say "use that
>> >>> specialize config here".
>> >>>
>> >>> Now, personally, I'd like to see a way to specialize config per
>> >>> transform. With an hint an easy way is to use a prefix: --s3-region
>> would
>> >>> become --prefix_transform1-s3-region. But to impl it i have
>> >>> https://github.com/apache/beam/pull/4683 which needs to be merged
>> before ;).
>> >>>
>> >>> Le 8 mars 2018 23:03, "Ismaël Mejía" <ieme...@gmail.com> a écrit :
>> >>>>
>> >>>> I was trying to create a really simple pipeline that read from a
>> >>>> bucket in a filesystem (s3) and writes to a different bucket in the
>> >>>> same filesystem.
>> >>>>
>> >>>>     S3Options options =
>> >>>> PipelineOptionsFactory.fromArgs(args).create().as(S3Options.class);
>> >>>>     Pipeline pipeline = Pipeline.create(options);
>> >>>>     pipeline
>> >>>>       .apply("ReadLines", TextIO.read().from("s3://src-bucket/*"))
>> >>>>       // .apply("AllOtherMagic", ...)
>> >>>>       .apply("WriteCounts", TextIO.write().to("s3://dst-bucket/"));
>> >>>>     p.run().waitUntilFinish();
>> >>>>
>> >>>> I discovered that my original bucket was in a different region so I
>> >>>> needed to pass a different S3Options object to the Write
>> >>>> ‘options.setAwsRegion(“dst-region”)’, but I could not find a way to
>> do
>> >>>> it. Can somebody give me a hint on how to do this?
>> >>>>
>> >>>> I was wondering that since File-based IOs use the configuration
>> >>>> implied by the Filesystem if this was possible. With non-file based
>> >>>> IOs all the configuration details are explicit in each specific
>> >>>> transform, but this is not the case for these file-based transforms.
>> >>>>
>> >>>> Note. I know this question probably belongs more to user@ but since
>> I
>> >>>> couldn’t find an easy way to do it I was wondering if this is an
>> issue
>> >>>> we should consider at dev@ from an API point of view.
>>
>
>

Re: Configuring file-based transforms with different options

Reply via email to