There is still a lot of work before we get to supporting cross language
transforms and hence get access to filesystems written in different
languages but how the options are passed through from one to the other will
need to be well understood and it would be best if the way a user defines
these filesystems is the same in all languages because it would be annoying
to provide the same configuration (in slightly different ways) for Java,
Python, Go, ...

On Fri, Mar 9, 2018 at 2:01 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 9 mars 2018 21:35, "Lukasz Cwik" <lc...@google.com> a écrit :
>
> The blocker is to get someone to follow through on the original design or
> to get a new design (with feedback) and have it implemented.
>
>
> If the pipelineoptionsfactory related pr are merged i can do a pr/proposal
> bases on this thread draft this month.
>
>
> Note that this impacts more than just Java as it also exists in Python and
> Go as well.
>
>
> Clearly outside my knowledge but since it is mainly java backed it should
> be almost transparent no? If not should it be part of the portable api on
> top of runners?
>
>
> On Fri, Mar 9, 2018 at 12:18 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Hmm, it doesnt solve the issue that beam doesnt enable to configure
>> transform from its "config" (let say the cli).
>>
>> So if i have a generic pipeline taking a file as input and another as
>> output then i must register 2 filesystems in all cases? If the pipeline is
>> dynamic i must make it dynamic too?
>>
>> Sounds pretty bad for end users and not generic - all transform hit this
>> issue since beam cant assume the impl. Using a prefix (namespace which can
>> be implicit or not) is simple, straight forward and enables all cases to be
>> handled smoothly for end users.
>>
>> What is the blocker to fix this design issue? I kind of fail to see why
>> we end up on a few particular cases with workarounds right now :s.
>>
>> Le 9 mars 2018 19:00, "Jacob Marble" <jacobmar...@gmail.com> a écrit :
>>
>>> I think when I wrote the S3 code, I couldn't see how to set storage
>>> class per-bucket, so put it in a flag. It's easy to imagine a use case
>>> where storage class differs per filespec, not only per bucket.
>>>
>>> Jacob
>>>
>>> On Fri, Mar 9, 2018 at 9:51 AM, Jacob Marble <jacobmar...@gmail.com>
>>> wrote:
>>>
>>>> Yes, I agree with all of this.
>>>>
>>>> Jacob
>>>>
>>>> On Thu, Mar 8, 2018 at 9:52 PM, Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov <kirpic...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I think it may have been an API design mistake to put the S3 region
>>>>>> into PipelineOptions.
>>>>>>
>>>>>
>>>>> +1, IMHO it's generally a mistake to put any transform configuration
>>>>> into PipelineOptions for exactly this reason.
>>>>>
>>>>>
>>>>>> PipelineOptions are global per pipeline, whereas it's totally
>>>>>> reasonable to access S3 files in different regions even from the code of 
>>>>>> a
>>>>>> single DoFn running on a single element. The same applies to
>>>>>> "setS3StorageClass".
>>>>>>
>>>>>> Jacob: what do you think? Why is it necessary to specify the S3
>>>>>> region at all - can AWS infer it automatically? Per
>>>>>> https://github.com/aws/aws-sdk-java/issues/1107 it seems that this
>>>>>> is possible via a setting on the client, so that the specified region is
>>>>>> used as the default but if the bucket is in a different region things 
>>>>>> still
>>>>>> work.
>>>>>>
>>>>>> As for the storage class: so far nobody complained ;) but it should
>>>>>> probably be specified via https://github.com/apache/
>>>>>> beam/blob/master/sdks/java/core/src/main/java/org/apache/bea
>>>>>> m/sdk/io/fs/CreateOptions.java instead of a pipeline option.
>>>>>>
>>>>>> On Thu, Mar 8, 2018 at 9:16 PM Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>> The "hint" would probably to use hints :) - indees this joke refers
>>>>>>> to the hint thread.
>>>>>>>
>>>>>>> Long story short with hints you should be able to say "use that
>>>>>>> specialize config here".
>>>>>>>
>>>>>>> Now, personally, I'd like to see a way to specialize config per
>>>>>>> transform. With an hint an easy way is to use a prefix: --s3-region 
>>>>>>> would
>>>>>>> become --prefix_transform1-s3-region. But to impl it i have
>>>>>>> https://github.com/apache/beam/pull/4683 which needs to be merged
>>>>>>> before ;).
>>>>>>>
>>>>>>> Le 8 mars 2018 23:03, "Ismaël Mejía" <ieme...@gmail.com> a écrit :
>>>>>>>
>>>>>>>> I was trying to create a really simple pipeline that read from a
>>>>>>>> bucket in a filesystem (s3) and writes to a different bucket in the
>>>>>>>> same filesystem.
>>>>>>>>
>>>>>>>>     S3Options options =
>>>>>>>> PipelineOptionsFactory.fromArgs(args).create().as(S3Options.class);
>>>>>>>>     Pipeline pipeline = Pipeline.create(options);
>>>>>>>>     pipeline
>>>>>>>>       .apply("ReadLines", TextIO.read().from("s3://src-bucket/*"))
>>>>>>>>       // .apply("AllOtherMagic", ...)
>>>>>>>>       .apply("WriteCounts", TextIO.write().to("s3://dst-bucket/"));
>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>
>>>>>>>> I discovered that my original bucket was in a different region so I
>>>>>>>> needed to pass a different S3Options object to the Write
>>>>>>>> ‘options.setAwsRegion(“dst-region”)’, but I could not find a way
>>>>>>>> to do
>>>>>>>> it. Can somebody give me a hint on how to do this?
>>>>>>>>
>>>>>>>> I was wondering that since File-based IOs use the configuration
>>>>>>>> implied by the Filesystem if this was possible. With non-file based
>>>>>>>> IOs all the configuration details are explicit in each specific
>>>>>>>> transform, but this is not the case for these file-based transforms.
>>>>>>>>
>>>>>>>> Note. I know this question probably belongs more to user@ but
>>>>>>>> since I
>>>>>>>> couldn’t find an easy way to do it I was wondering if this is an
>>>>>>>> issue
>>>>>>>> we should consider at dev@ from an API point of view.
>>>>>>>>
>>>>>>>
>>>>
>>>
>
>

Reply via email to