I think a way to have transform-specific options could be useful, regardless of this use case.
On Mon, Mar 12, 2018 at 12:42 PM Romain Manni-Bucau <[email protected]> wrote: > > > Le 12 mars 2018 18:56, "Chamikara Jayalath" <[email protected]> a > écrit : > > Agree. We need file-system abstractions in all languages since (1) users > may need to directly access file-systems from DoFns (2) common file-based > sources/sinks will probably will be available in multiple languages even > with portability API and cross language IO (these are usually the first > sources/sinks that get implemented in an SDK and server as reference > implementations for other sources/sinks). > > > Can you detail 1? Shouldnt otherwise the abstraction is not that great. > > Side note on this one: vfs uses static utilitiez to setup custom configs > for specific impl, beam can do the same and hides the unwrapping. > > However Im a bit sad we always end up of the fs whereas the issue is > generic: how to bind some specific config for a transform? > > Ex: > - parallelism by source in the direct runner and not globally > - different pool config for jdbc.read1 and jdbc.write1 cause concurrency > is different > - different jms provider for input1 and output2 > - etc... > > As soon as: > - the pipeline is "generic" (it takes a jms source/output or a file path) > Or > - an io uses a not exposed through the main api but configurable impl > > Then you need a config per transform of the pipeline. Filesystem is not > different from the dozens of the cases so it would be great to fix it once > for all no? > > Thinking a bit more, and without using hints, we can just use the name if > set in apply: > > p.apply("myprefix", ...) would accept -Dmyprefix::myOption=foo for the > scope of the fn and if not set it would try to use myOption (inheritance > from the global context). This way we are simple, generic, and our config > becomes scalable and stays storable nicely. > > > > - Cham > > > On Mon, Mar 12, 2018 at 10:48 AM Lukasz Cwik <[email protected]> wrote: > >> There is still a lot of work before we get to supporting cross language >> transforms and hence get access to filesystems written in different >> languages but how the options are passed through from one to the other will >> need to be well understood and it would be best if the way a user defines >> these filesystems is the same in all languages because it would be annoying >> to provide the same configuration (in slightly different ways) for Java, >> Python, Go, ... >> >> On Fri, Mar 9, 2018 at 2:01 PM, Romain Manni-Bucau <[email protected] >> > wrote: >> >>> >>> >>> Le 9 mars 2018 21:35, "Lukasz Cwik" <[email protected]> a écrit : >>> >>> The blocker is to get someone to follow through on the original design >>> or to get a new design (with feedback) and have it implemented. >>> >>> >>> If the pipelineoptionsfactory related pr are merged i can do a >>> pr/proposal bases on this thread draft this month. >>> >>> >>> Note that this impacts more than just Java as it also exists in Python >>> and Go as well. >>> >>> >>> Clearly outside my knowledge but since it is mainly java backed it >>> should be almost transparent no? If not should it be part of the portable >>> api on top of runners? >>> >>> >>> On Fri, Mar 9, 2018 at 12:18 PM, Romain Manni-Bucau < >>> [email protected]> wrote: >>> >>>> Hmm, it doesnt solve the issue that beam doesnt enable to configure >>>> transform from its "config" (let say the cli). >>>> >>>> So if i have a generic pipeline taking a file as input and another as >>>> output then i must register 2 filesystems in all cases? If the pipeline is >>>> dynamic i must make it dynamic too? >>>> >>>> Sounds pretty bad for end users and not generic - all transform hit >>>> this issue since beam cant assume the impl. Using a prefix (namespace which >>>> can be implicit or not) is simple, straight forward and enables all cases >>>> to be handled smoothly for end users. >>>> >>>> What is the blocker to fix this design issue? I kind of fail to see why >>>> we end up on a few particular cases with workarounds right now :s. >>>> >>>> Le 9 mars 2018 19:00, "Jacob Marble" <[email protected]> a écrit : >>>> >>>>> I think when I wrote the S3 code, I couldn't see how to set storage >>>>> class per-bucket, so put it in a flag. It's easy to imagine a use case >>>>> where storage class differs per filespec, not only per bucket. >>>>> >>>>> Jacob >>>>> >>>>> On Fri, Mar 9, 2018 at 9:51 AM, Jacob Marble <[email protected]> >>>>> wrote: >>>>> >>>>>> Yes, I agree with all of this. >>>>>> >>>>>> Jacob >>>>>> >>>>>> On Thu, Mar 8, 2018 at 9:52 PM, Robert Bradshaw <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I think it may have been an API design mistake to put the S3 region >>>>>>>> into PipelineOptions. >>>>>>>> >>>>>>> >>>>>>> +1, IMHO it's generally a mistake to put any transform configuration >>>>>>> into PipelineOptions for exactly this reason. >>>>>>> >>>>>>> >>>>>>>> PipelineOptions are global per pipeline, whereas it's totally >>>>>>>> reasonable to access S3 files in different regions even from the code >>>>>>>> of a >>>>>>>> single DoFn running on a single element. The same applies to >>>>>>>> "setS3StorageClass". >>>>>>>> >>>>>>>> Jacob: what do you think? Why is it necessary to specify the S3 >>>>>>>> region at all - can AWS infer it automatically? Per >>>>>>>> https://github.com/aws/aws-sdk-java/issues/1107 it seems that this >>>>>>>> is possible via a setting on the client, so that the specified region >>>>>>>> is >>>>>>>> used as the default but if the bucket is in a different region things >>>>>>>> still >>>>>>>> work. >>>>>>>> >>>>>>>> As for the storage class: so far nobody complained ;) but it should >>>>>>>> probably be specified via >>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/CreateOptions.java >>>>>>>> instead >>>>>>>> of a pipeline option. >>>>>>>> >>>>>>>> On Thu, Mar 8, 2018 at 9:16 PM Romain Manni-Bucau < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> The "hint" would probably to use hints :) - indees this joke >>>>>>>>> refers to the hint thread. >>>>>>>>> >>>>>>>>> Long story short with hints you should be able to say "use that >>>>>>>>> specialize config here". >>>>>>>>> >>>>>>>>> Now, personally, I'd like to see a way to specialize config per >>>>>>>>> transform. With an hint an easy way is to use a prefix: --s3-region >>>>>>>>> would >>>>>>>>> become --prefix_transform1-s3-region. But to impl it i have >>>>>>>>> https://github.com/apache/beam/pull/4683 which needs to be merged >>>>>>>>> before ;). >>>>>>>>> >>>>>>>>> Le 8 mars 2018 23:03, "Ismaël Mejía" <[email protected]> a écrit : >>>>>>>>> >>>>>>>>>> I was trying to create a really simple pipeline that read from a >>>>>>>>>> bucket in a filesystem (s3) and writes to a different bucket in >>>>>>>>>> the >>>>>>>>>> same filesystem. >>>>>>>>>> >>>>>>>>>> S3Options options = >>>>>>>>>> >>>>>>>>>> PipelineOptionsFactory.fromArgs(args).create().as(S3Options.class); >>>>>>>>>> Pipeline pipeline = Pipeline.create(options); >>>>>>>>>> pipeline >>>>>>>>>> .apply("ReadLines", TextIO.read().from("s3://src-bucket/*")) >>>>>>>>>> // .apply("AllOtherMagic", ...) >>>>>>>>>> .apply("WriteCounts", >>>>>>>>>> TextIO.write().to("s3://dst-bucket/")); >>>>>>>>>> p.run().waitUntilFinish(); >>>>>>>>>> >>>>>>>>>> I discovered that my original bucket was in a different region so >>>>>>>>>> I >>>>>>>>>> needed to pass a different S3Options object to the Write >>>>>>>>>> ‘options.setAwsRegion(“dst-region”)’, but I could not find a way >>>>>>>>>> to do >>>>>>>>>> it. Can somebody give me a hint on how to do this? >>>>>>>>>> >>>>>>>>>> I was wondering that since File-based IOs use the configuration >>>>>>>>>> implied by the Filesystem if this was possible. With non-file >>>>>>>>>> based >>>>>>>>>> IOs all the configuration details are explicit in each specific >>>>>>>>>> transform, but this is not the case for these file-based >>>>>>>>>> transforms. >>>>>>>>>> >>>>>>>>>> Note. I know this question probably belongs more to user@ but >>>>>>>>>> since I >>>>>>>>>> couldn’t find an easy way to do it I was wondering if this is an >>>>>>>>>> issue >>>>>>>>>> we should consider at dev@ from an API point of view. >>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>> >>> >> >
