File-based transforms are a little bit different because there is a
part of the configuration in the file transform (TextIO.read().foo(),
TextIO.write().bar()) and other part done in specific filesystem
options.

In the example TextIO.from(“...”) does not have a way to do something
like ‘.withRegion’ (because probably does not make sense to put it
there). So you parameterize this in the FileSystem options to make it
work. If possible I would have created two independent S3Options
objects and passed those to the Read and Write transforms but this is
not supported and not the expected PipelineOptions behavior as Robert
mentioned.

Thinking about other cases I am not sure how can you configure a Read
from one HDFS cluster (Configuration) and a Write to a different HDFS
one (Note that HadoopFileSystemOptions supports multiple
configurations but I have the impression this is to support multiple
schemes, not the same goal). I am even wondering if this same issue is
valid for GcsOptions and its zones or if I would want to define a
different storage class for the written results (extra question: Do we
have an option to do this in Gcs?).

Eugene, not sure I follow the idea with CreateOptions, what you intend
is to use it to parameterize the specific transform? Of how can I get
it to pass a specific CreateOptions to the TextIO.read/write so they
end up being used by the FileSystem methods that receive a
CreateOptions object?

On Fri, Mar 9, 2018 at 6:52 AM, Robert Bradshaw <rober...@google.com> wrote:
> On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>>
>> I think it may have been an API design mistake to put the S3 region into
>> PipelineOptions.
>
>
> +1, IMHO it's generally a mistake to put any transform configuration into
> PipelineOptions for exactly this reason.
>
>>
>> PipelineOptions are global per pipeline, whereas it's totally reasonable
>> to access S3 files in different regions even from the code of a single DoFn
>> running on a single element. The same applies to "setS3StorageClass".
>>
>> Jacob: what do you think? Why is it necessary to specify the S3 region at
>> all - can AWS infer it automatically? Per
>> https://github.com/aws/aws-sdk-java/issues/1107 it seems that this is
>> possible via a setting on the client, so that the specified region is used
>> as the default but if the bucket is in a different region things still work.
>>
>> As for the storage class: so far nobody complained ;) but it should
>> probably be specified via
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/CreateOptions.java
>> instead of a pipeline option.
>>
>> On Thu, Mar 8, 2018 at 9:16 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>>
>>> The "hint" would probably to use hints :) - indees this joke refers to
>>> the hint thread.
>>>
>>> Long story short with hints you should be able to say "use that
>>> specialize config here".
>>>
>>> Now, personally, I'd like to see a way to specialize config per
>>> transform. With an hint an easy way is to use a prefix: --s3-region would
>>> become --prefix_transform1-s3-region. But to impl it i have
>>> https://github.com/apache/beam/pull/4683 which needs to be merged before ;).
>>>
>>> Le 8 mars 2018 23:03, "Ismaël Mejía" <ieme...@gmail.com> a écrit :
>>>>
>>>> I was trying to create a really simple pipeline that read from a
>>>> bucket in a filesystem (s3) and writes to a different bucket in the
>>>> same filesystem.
>>>>
>>>>     S3Options options =
>>>> PipelineOptionsFactory.fromArgs(args).create().as(S3Options.class);
>>>>     Pipeline pipeline = Pipeline.create(options);
>>>>     pipeline
>>>>       .apply("ReadLines", TextIO.read().from("s3://src-bucket/*"))
>>>>       // .apply("AllOtherMagic", ...)
>>>>       .apply("WriteCounts", TextIO.write().to("s3://dst-bucket/"));
>>>>     p.run().waitUntilFinish();
>>>>
>>>> I discovered that my original bucket was in a different region so I
>>>> needed to pass a different S3Options object to the Write
>>>> ‘options.setAwsRegion(“dst-region”)’, but I could not find a way to do
>>>> it. Can somebody give me a hint on how to do this?
>>>>
>>>> I was wondering that since File-based IOs use the configuration
>>>> implied by the Filesystem if this was possible. With non-file based
>>>> IOs all the configuration details are explicit in each specific
>>>> transform, but this is not the case for these file-based transforms.
>>>>
>>>> Note. I know this question probably belongs more to user@ but since I
>>>> couldn’t find an easy way to do it I was wondering if this is an issue
>>>> we should consider at dev@ from an API point of view.

Reply via email to