Re: [Request for Feedback] Swift SDK Prototype

Danny McCormick via user Wed, 20 Sep 2023 10:48:50 -0700

> I think the process should be similar to other code/design reviews for
large contributions. I don't think you need a PMC involvement here.


I think it does require PMC involvement to create the actual repo once we
have public consensus. I tried the flow at
https://infra.apache.org/version-control.html#create but it seems like its
PMC only. It's unclear to me if consensus has been achieved, maybe a
dedicated voting thread with implied lazy consensus would help here.

> Sure, we could definitely include things as a submodule for stuff like
testing multi-language, though I think there's actually a cleaner way just
using the Swift package manager's test facilities to access the swift sdk
repo.

+1 on avoiding submodules. If needed we could also use multi-repo checkout
with GitHub Actions. I think my biggest question is what we'd actually be
enforcing though. In general, I'd expect the normal update flow to be

1) Update Beam protos and/or multi-lang components (though the set of
things that needs updated for multi-lang is unclear to me)
2) Mirror those changes to the Swift SDK.

The thing that is most likely to be forgotten is the 2nd step, and that is
hard to enforce with automation since the automation would either be on the
first step which doesn't have anything to enforce or on some sort of
schedule in the swift repo, which is less likely to be visible. I'm a
little worried we wouldn't notice breakages until release time.

I wonder how much stuff happens outside of the proto directory that needs
to be mirrored. Could we just create scheduled automation to exactly copy
changes in the proto directory and version changes for multi-lang stuff to
the swift SDK repo?

---------------------------------------------------------------------

Regardless, I'm +1 on a dedicated repo; I'd rather we take on some
organizational weirdness than push that pain to users.

Thanks,
Danny

On Wed, Sep 20, 2023 at 1:38 PM Byron Ellis via user <user@beam.apache.org>
wrote:

> Sure, we could definitely include things as a submodule for stuff like
> testing multi-language, though I think there's actually a cleaner way just
> using the Swift package manager's test facilities to access the swift sdk
> repo.
>
>  That would also be consistent with the user-side experience and let us
> test things like build-time integrations with multi-language as well (which
> is possible in Swift through compiler plugins) in the same way as a
> pipeline author would. You also maybe get backwards compatibility testing
> as a side effect in that case as well.
>
>
>
>
>
>
> On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>>
>>
>>
>> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've chatted with a couple of people offline about this and my
>>> impression is that folks are generally amenable to a separate repo to match
>>> the target community? I have no idea what the next steps would be though
>>> other than guessing that there's probably some sort of PMC thing involved?
>>> Should I write something up somewhere?
>>>
>>
>> I think the process should be similar to other code/design reviews for
>> large contributions. I don't think you need a PMC involvement here.
>>
>>
>>>
>>> Best,
>>> B
>>>
>>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've been on vacation, but mostly working on getting External Transform
>>>> support going (which in turn basically requires Schema support as well). It
>>>> also looks like macros landed in Swift 5.9 for Linux so we'll be able to
>>>> use those to do some compile-time automation. In particular, this lets us
>>>> do something similar to what Java does with ByteBuddy for generating schema
>>>> coders though it has to be ahead of time so not quite the same. (As far as
>>>> I can tell this is a reason why macros got added to the language in the
>>>> first place---Apple's SwiftData library makes heavy use of the feature).
>>>>
>>>> I do have one question for the group though: should the Swift SDK
>>>> distribution take on Beam community properties or Swift community
>>>> properties? Specifically, in the Swift world the Swift SDK would live in
>>>> its own repo (beam-swift for example), which allows it to be most easily
>>>> consumed and keeps the checkout size under control for users. "Releases" in
>>>> the Swift world (much like Go) are just repo tags. The downside here is
>>>> that there's overhead in setting up the various github actions and other
>>>> CI/CD bits and bobs.
>>>>
>>>>
>>
>>> The alternative would be to keep it in the beam repo itself like it is
>>>> now, but we'd probably want to move Package.swift to the root since for
>>>> whatever reason the Swift community (much to some people's annoyance) has
>>>> chosen to have packages only really able to live at the top of a repo. This
>>>> has less overhead from a CI/CD perspective, but lots of overhead for users
>>>> as they'd be checking out the entire Beam repo to use the SDK, which
>>>> happens a lot.
>>>>
>>>> There's a third option which is basically "do both" but honestly that
>>>> just seems like the worst of both worlds as it would require constant
>>>> syncing if we wanted to make it possible for Swift users to target
>>>> unreleased SDKs for development and testing.
>>>>
>>>> Personally, I would lean towards the former option (and would volunteer
>>>> to set up & document the various automations) as it is lighter for the
>>>> actual users of the SDK and more consistent with the community experience
>>>> they expect. The CI/CD stuff is mostly a "do it once" whereas checking out
>>>> the entire repo with many updates the user doesn't care about is something
>>>> they will be doing all the time. FWIW some of our dependencies also chose
>>>> this route---most notably GRPC which started with the latter approach and
>>>> has moved to the former.
>>>>
>>>
>> I believe existing SDKs benefit from living in the same repo. For
>> example, it's easier to keep them consistent with any model/proto changes
>> and it's easier to manage distributions/tags. Also it's easier to keep
>> components consistent for multi-lang. If we add Swift to a separate repo,
>> we'll probably have to add tooling/scripts to keep things consistent.
>> Is it possible to create a separate repo, but also add a reference (and
>> Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests to make
>> sure that things stay consistent ?
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>>> Interested to hear any feedback on the subject since I'm guessing it
>>>> probably came up with the Go SDK back in the day?
>>>>
>>>> Best,
>>>> B
>>>>
>>>>
>>>>
>>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com>
>>>> wrote:
>>>>
>>>>> After a couple of iterations (thanks rebo!) we've also gotten the
>>>>> Swift SDK working with the new Prism runner. The fact that it doesn't do
>>>>> fusion caught a couple of configuration bugs (e.g. that the grpc message
>>>>> receiver buffer should be fairly large). It would seem that at the moment
>>>>> Prism and the Flink runner have similar orders of strictness when
>>>>> interpreting the pipeline graph while the Python portable runner is far
>>>>> more forgiving.
>>>>>
>>>>> Also added support for bounded vs unbounded pcollections through the
>>>>> "type" parameter when adding a pardo. Impulse is a bounded pcollection I
>>>>> believe?
>>>>>
>>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Okay, after a brief detour through "get this working in the Flink
>>>>>> Portable Runner" I think I have something pretty workable.
>>>>>>
>>>>>> PInput and POutput can actually be structs rather than protocols,
>>>>>> which simplifies things quite a bit. It also allows us to use them with
>>>>>> property wrappers for a SwiftUI-like experience if we want when defining
>>>>>> DoFns (which is what I was originally intending to use them for). That 
>>>>>> also
>>>>>> means the function signature you use for closures would match 
>>>>>> full-fledged
>>>>>> DoFn definitions for the most part which is satisfying.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis <byronel...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay, I tried a couple of different things.
>>>>>>>
>>>>>>> Implicitly passing the timestamp and window during iteration did not
>>>>>>> go well. While physically possible it introduces an invisible side 
>>>>>>> effect
>>>>>>> into loop iteration which confused me when I tried to use it and I
>>>>>>> implemented it. Also, I'm pretty sure there'd end up being some sort of
>>>>>>> race condition nightmare continuing down that path.
>>>>>>>
>>>>>>> What I decided to do instead was the following:
>>>>>>>
>>>>>>> 1. Rename the existing "pardo" functions to "pstream" and require
>>>>>>> that they always emit a window and timestamp along with their value. 
>>>>>>> This
>>>>>>> eliminates the side effect but lets us keep iteration in a bundle where
>>>>>>> that might be convenient. For example, in my cheesy GCS implementation 
>>>>>>> it
>>>>>>> means that I can keep an OAuth token around for the lifetime of the 
>>>>>>> bundle
>>>>>>> as a local variable, which is convenient. It's a bit more typing for 
>>>>>>> users
>>>>>>> of pstream, but the expectation here is that if you're using pstream
>>>>>>> functions You Know What You Are Doing and most people won't be using it
>>>>>>> directly.
>>>>>>>
>>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of them
>>>>>>> yet, but enough to test the functionality and decide I liked it) which 
>>>>>>> take
>>>>>>> a function signature of (any PInput<InputType>,any POutput<OutputType>).
>>>>>>> PInput takes the (InputType,Date,Window) tuple and converts it into a
>>>>>>> struct with friendlier names. Not strictly necessary, but makes the code
>>>>>>> nicer to read I think. POutput introduces emit functions that optionally
>>>>>>> allow you to specify a timestamp and a window. If you don't for either 
>>>>>>> one
>>>>>>> it will take the timestamp and/or window of the input.
>>>>>>>
>>>>>>> Trying to use that was pretty pleasant to use so I think we should
>>>>>>> continue down that path. If you'd like to see it in use, I reimplemented
>>>>>>> map() and flatMap() in terms of this new pardo functionality.
>>>>>>>
>>>>>>> Code has been pushed to the branch/PR if you're interested in taking
>>>>>>> a look.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis <byronel...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Gotcha, I think there's a fairly easy solution to link input and
>>>>>>>> output streams.... Let me try it out... might even be possible to have 
>>>>>>>> both
>>>>>>>> element and stream-wise closure pardos. Definitely possible to have 
>>>>>>>> that at
>>>>>>>> the DoFn level (called SerializableFn in the SDK because I want to
>>>>>>>> use @DoFn as a macro)
>>>>>>>>
>>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>>
>>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath <
>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I would like to figure out a way to get the stream-y interface
>>>>>>>>>>> to work, as I think it's more natural overall.
>>>>>>>>>>>
>>>>>>>>>>> One hypothesis is that if any elements are carried over loop
>>>>>>>>>>> iterations, there will likely be some that are carried over beyond 
>>>>>>>>>>> the loop
>>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to 
>>>>>>>>>>> end). We
>>>>>>>>>>> could reject "plain" elements that are emitted after this point, 
>>>>>>>>>>> requiring
>>>>>>>>>>> one to emit timestamp-windowed-values.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are you assuming that the same stream (or overlapping sets of
>>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of data
>>>>>>>>>> streamed here are the data that belong to the current bundle (hence 
>>>>>>>>>> already
>>>>>>>>>> assigned to the current worker) so any output from the current bundle
>>>>>>>>>> invocation would be a valid output of that bundle.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> Yes, the content of the stream is exactly the contents of the
>>>>>>>>> bundle. The question is how to do the input_element:output_element
>>>>>>>>> correlation for automatically propagating metadata.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Related to this, we could enforce that the only (user-accessible)
>>>>>>>>>>> way to get such a timestamped value is to start with one, e.g. a
>>>>>>>>>>> WindowedValue<T>.withValue(O) produces a WindowedValue<O> with the 
>>>>>>>>>>> same
>>>>>>>>>>> metadata but a new value. Thus a user wanting to do anything 
>>>>>>>>>>> "fancy" would
>>>>>>>>>>> have to explicitly request iteration over these windowed values 
>>>>>>>>>>> rather than
>>>>>>>>>>> over the raw elements. (This is also forward compatible with 
>>>>>>>>>>> expanding the
>>>>>>>>>>> metadata that can get attached, e.g. pane infos, and makes the 
>>>>>>>>>>> right thing
>>>>>>>>>>> the easiest/most natural.)
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis <
>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ah, that is a good point—being element-wise would make managing
>>>>>>>>>>>> windows and time stamps easier for the user. Fortunately it’s a 
>>>>>>>>>>>> fairly easy
>>>>>>>>>>>> change to make and maybe even less typing for the user. I was 
>>>>>>>>>>>> originally
>>>>>>>>>>>> thinking side inputs and metrics would happen outside the loop, 
>>>>>>>>>>>> but I think
>>>>>>>>>>>> you want a class and not a closure at that point for sanity.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, I see.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole
>>>>>>>>>>>>> bundle rather than start/finish bundle callbacks, but one of the 
>>>>>>>>>>>>> questions
>>>>>>>>>>>>> is how that would impact implicit passing of the timestamp (and 
>>>>>>>>>>>>> other)
>>>>>>>>>>>>> metadata from input elements to output elements. (You can of 
>>>>>>>>>>>>> course attach
>>>>>>>>>>>>> the metadata to any output that happens in the loop body, but 
>>>>>>>>>>>>> it's very
>>>>>>>>>>>>> easy to implicitly to break the 1:1 relationship here (e.g. by 
>>>>>>>>>>>>> doing
>>>>>>>>>>>>> buffering or otherwise modifying local state) and this would be 
>>>>>>>>>>>>> hard to
>>>>>>>>>>>>> detect. (I suppose trying to output after the loop finishes could 
>>>>>>>>>>>>> require
>>>>>>>>>>>>> something more explicit).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis <
>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise
>>>>>>>>>>>>>> collection operations like "map" that eliminate the need for 
>>>>>>>>>>>>>> pardo in many
>>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey under 
>>>>>>>>>>>>>> the hood.
>>>>>>>>>>>>>> That was to be more consistent with Swift's collection protocol 
>>>>>>>>>>>>>> (and is
>>>>>>>>>>>>>> also why PCollection and PCollectionStream are different types...
>>>>>>>>>>>>>> PCollection implements map and friends as pipeline construction 
>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>> whereas PCollectionStream is an actual stream)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses map
>>>>>>>>>>>>>> rather than pardo in a couple of places to do a true wordcount 
>>>>>>>>>>>>>> using good
>>>>>>>>>>>>>> ol' Shakespeare and very very primitive GCS IO.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <
>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo syntax
>>>>>>>>>>>>>>> quite a bit before settling on where I ended up. Ultimately I 
>>>>>>>>>>>>>>> decided to go
>>>>>>>>>>>>>>> with something that felt more Swift-y than anything else which 
>>>>>>>>>>>>>>> means that
>>>>>>>>>>>>>>> rather than dealing with a single element like you do in the 
>>>>>>>>>>>>>>> other SDKs
>>>>>>>>>>>>>>> you're dealing with a stream of elements (which of course will 
>>>>>>>>>>>>>>> often be of
>>>>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world 
>>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>>> with the async / await structures. So when you see something 
>>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> filenames is the input stream and then output and errors are
>>>>>>>>>>>>>>> both output streams. In theory you can have as many output 
>>>>>>>>>>>>>>> streams as you
>>>>>>>>>>>>>>> like though at the moment there's a compiler bug in the new 
>>>>>>>>>>>>>>> type pack
>>>>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>>>>>> Presumably
>>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you had parameterization you wanted to send that would
>>>>>>>>>>>>>>> look like pardo("Parameter") { param,filenames,output,error in 
>>>>>>>>>>>>>>> ... } where
>>>>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is 
>>>>>>>>>>>>>>> being
>>>>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you
>>>>>>>>>>>>>>> have in ES6 and other things where "_" is Swift for "ignore." 
>>>>>>>>>>>>>>> In this case
>>>>>>>>>>>>>>> PCollectionStreams have an element signature of 
>>>>>>>>>>>>>>> (Of,Date,Window) so you can
>>>>>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>>>>>> manipulate
>>>>>>>>>>>>>>> it somehow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>>>>>> pardos--- that would probably mean having explicit type 
>>>>>>>>>>>>>>> signatures in the
>>>>>>>>>>>>>>> closure. I had that at one point, but it felt less natural the 
>>>>>>>>>>>>>>> more I used
>>>>>>>>>>>>>>> it. I'm also slowly working towards adding a more "traditional" 
>>>>>>>>>>>>>>> DoFn
>>>>>>>>>>>>>>> implementation approach where you implement the DoFn as an 
>>>>>>>>>>>>>>> object type. In
>>>>>>>>>>>>>>> that case it would be very very easy to support both by having 
>>>>>>>>>>>>>>> a default
>>>>>>>>>>>>>>> stream implementation call the equivalent of processElement. To 
>>>>>>>>>>>>>>> make that
>>>>>>>>>>>>>>> performant I need to implement an @DoFn macro and I just 
>>>>>>>>>>>>>>> haven't gotten to
>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>>>>>>>>> composite and external transforms for the reasons you suggest. 
>>>>>>>>>>>>>>> :-) I've got
>>>>>>>>>>>>>>> the basics of a composite transform (there's an equivalent 
>>>>>>>>>>>>>>> wordcount
>>>>>>>>>>>>>>> example) and am hooking it into the pipeline generation, which 
>>>>>>>>>>>>>>> should also
>>>>>>>>>>>>>>> give me everything I need to successfully hook in external 
>>>>>>>>>>>>>>> transforms as
>>>>>>>>>>>>>>> well. That will give me the jump on IOs as you say. I can also 
>>>>>>>>>>>>>>> treat the
>>>>>>>>>>>>>>> pipeline itself as a composite transform which lets me get rid 
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>> Pipeline { pipeline in ... } and just instead have things 
>>>>>>>>>>>>>>> attach themselves
>>>>>>>>>>>>>>> to the pipeline implicitly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That said, there are some interesting IO possibilities that
>>>>>>>>>>>>>>> would be Swift native. In particularly, I've been looking at 
>>>>>>>>>>>>>>> the native
>>>>>>>>>>>>>>> Swift binding for DuckDB (which is C++ based). DuckDB is SQL 
>>>>>>>>>>>>>>> based but not
>>>>>>>>>>>>>>> distributed in the same was as, say, Beam SQL... but it would 
>>>>>>>>>>>>>>> allow for SQL
>>>>>>>>>>>>>>> statements on individual files with projection pushdown 
>>>>>>>>>>>>>>> supported for
>>>>>>>>>>>>>>> things like Parquet which could have some cool and performant 
>>>>>>>>>>>>>>> data lake
>>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that 
>>>>>>>>>>>>>>> would give
>>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes it 
>>>>>>>>>>>>>>> pretty easy to
>>>>>>>>>>>>>>> work with GCS.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute here
>>>>>>>>>>>>>>> and there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Neat.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how the
>>>>>>>>>>>>>>>> FnAPI works :). I like the use of groupBy. I have to admit I'm 
>>>>>>>>>>>>>>>> a bit
>>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at all 
>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>> probably tripping me up). The addition of external 
>>>>>>>>>>>>>>>> (cross-language)
>>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) pretty 
>>>>>>>>>>>>>>>> quickly from
>>>>>>>>>>>>>>>> other SDKs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet
>>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that 
>>>>>>>>>>>>>>>>> need to properly
>>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed 
>>>>>>>>>>>>>>>>> values and length
>>>>>>>>>>>>>>>>> in iterable coders as those both use specifically bigendian 
>>>>>>>>>>>>>>>>> representations)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <
>>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Cham,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like since 
>>>>>>>>>>>>>>>>>> most of the LOC
>>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I definitely 
>>>>>>>>>>>>>>>>>> want to add
>>>>>>>>>>>>>>>>>> external transforms and may actually add that support before 
>>>>>>>>>>>>>>>>>> adding the
>>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With the 
>>>>>>>>>>>>>>>>>> way the SDK is
>>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a 
>>>>>>>>>>>>>>>>>> separate operation
>>>>>>>>>>>>>>>>>> than defining a composite.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there is
>>>>>>>>>>>>>>>>>>> interest in Swift SDK from folks currently subscribed to the
>>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to
>>>>>>>>>>>>>>>>>>>> really understand how the Beam FnApi works and how it 
>>>>>>>>>>>>>>>>>>>> interacts with the
>>>>>>>>>>>>>>>>>>>> Portable Runner. For me at least that usually means I need 
>>>>>>>>>>>>>>>>>>>> to write some
>>>>>>>>>>>>>>>>>>>> code so I can see things happening in a debugger and to 
>>>>>>>>>>>>>>>>>>>> really prove to
>>>>>>>>>>>>>>>>>>>> myself I understood what was going on I decided I couldn't 
>>>>>>>>>>>>>>>>>>>> use an existing
>>>>>>>>>>>>>>>>>>>> SDK language to do it since there would be the temptation 
>>>>>>>>>>>>>>>>>>>> to read some code
>>>>>>>>>>>>>>>>>>>> and convince myself that I actually understood what was 
>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to get a
>>>>>>>>>>>>>>>>>>>> minimal FnApi integration going you end up writing a fair 
>>>>>>>>>>>>>>>>>>>> bit of an SDK. So
>>>>>>>>>>>>>>>>>>>> I decided to take things to a point where I had an SDK 
>>>>>>>>>>>>>>>>>>>> that could execute a
>>>>>>>>>>>>>>>>>>>> word count example via a portable runner backend. I've now 
>>>>>>>>>>>>>>>>>>>> reached that
>>>>>>>>>>>>>>>>>>>> point and would like to submit my prototype SDK to the 
>>>>>>>>>>>>>>>>>>>> list for feedback.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode Beta
>>>>>>>>>>>>>>>>>>>> using Swift 5.9 on Intel Macs, but should also work using 
>>>>>>>>>>>>>>>>>>>> beta builds of
>>>>>>>>>>>>>>>>>>>> 5.9 for Linux running on Intel hardware. I haven't had a 
>>>>>>>>>>>>>>>>>>>> chance to try it
>>>>>>>>>>>>>>>>>>>> on ARM hardware and make sure all of the endian checks are 
>>>>>>>>>>>>>>>>>>>> complete. The
>>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count 
>>>>>>>>>>>>>>>>>>>> example that reads some
>>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ 
>>>>>>>>>>>>>>>>>>>> functionality) and
>>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations to 
>>>>>>>>>>>>>>>>>>>> get it past the
>>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against the 
>>>>>>>>>>>>>>>>>>>> Python Portable
>>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no 
>>>>>>>>>>>>>>>>>>>> Direct Runner at this
>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and
>>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for example 
>>>>>>>>>>>>>>>>>>>> pardo was
>>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In 
>>>>>>>>>>>>>>>>>>>> general I've tried to
>>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy 
>>>>>>>>>>>>>>>>>>>> reliance on closures
>>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms there's 
>>>>>>>>>>>>>>>>>>>> the beginnings of
>>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative API 
>>>>>>>>>>>>>>>>>>>> for creating them.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to be
>>>>>>>>>>>>>>>>>>>> implemented, like counters, metrics, windowing, state, 
>>>>>>>>>>>>>>>>>>>> timers, etc.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This should be fine and we can get the code documented
>>>>>>>>>>>>>>>>>>> without these features. I think support for composites and 
>>>>>>>>>>>>>>>>>>> adding an
>>>>>>>>>>>>>>>>>>> external transform (see, Java
>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>>>>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>>>>>>>>>>>>>>>>> Go
>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>>>>>>>>>>>>>>>>> TypeScript
>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of 
>>>>>>>>>>>>>>>>>>> features (for example,
>>>>>>>>>>>>>>>>>>> I/O connectors) for free.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a PR
>>>>>>>>>>>>>>>>>>>> if folks are interested, though the "Swift Way" would be 
>>>>>>>>>>>>>>>>>>>> to have it in its
>>>>>>>>>>>>>>>>>>>> own repo so that it can easily be used from the Swift 
>>>>>>>>>>>>>>>>>>>> Package Manager.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially). Also
>>>>>>>>>>>>>>>>>>> it'll be easier to comment on a PR :)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Cham
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

Reply via email to