On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <joey.t...@schrodinger.com>
wrote:

> That makes sense. Would you suggest the new option simply suppress the
> RuntimeError and use the non-unique label?
>

Yes. (Or, rather, not raise it.)


> Are there places on the SDK side that expect unique labels? Or in
> non-updateable runners?
>

That's a good question. The label eventually ends up here:
https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
which is technically a violation of the spec if these are not unique
(though I guess Java already does this?). More importantly, though, the id
of the transform (the proto contains a map of strings -> transforms) must
be unique.


> Would making `--auto_unique_labels` a mutually exclusive option with
> streaming be a reasonable option? Not sure if stream/batch-specific options
> are a good idea or not... but if there's precedent, then it'd be an easy
> option
>

It's not always possible to even tell at pipeline construction whether the
pipeline will be run in streaming mode. (I suppose one could check later?)
It's generally something we try to avoid though.

Another option is to make the suffix a uuid rather than a single counter.
(This would still have issues with the first application possibly getting
mixed up with a "different" first application unless it was always
appended.)


> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>>
>> Thanks for the PR.
>>
>> I think we should follow Java and allow non-unique labels, but not
>> provide automatic uniquification, In particular, the danger of using a
>> counter is that one can get accidental (and potentially hard to check)
>> off-by-one collisions. As a concrete example, imagine one partitions a
>> dataset into two collections, each followed by a similarly-named transform.
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B
>>
>> Uniquification would give something like
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B_2
>>
>> Suppose one then realizes there's a third case to handle, giving
>>
>>     --> B
>>   /
>> A --> B
>>  \
>>    --> B
>>
>> But this would be uniquified to
>>
>>     --> B
>>   /
>> A --> B_2
>>  \
>>    --> B_3
>>
>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>> This is bad because an updating runner would then attribute old B_2's state
>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>> least with the old, intersecting names we can detect this problem
>> rather than silently give corrupt data.
>>
>>
>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <joey.t...@schrodinger.com>
>> wrote:
>>
>>> For posterity: https://github.com/apache/beam/pull/28984
>>>
>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> I would definitely support a PR making this an option. Changing the
>>>> default would be a rather big change that would require more thought.
>>>>
>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>>> might find inconvenient.
>>>>>
>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>> happy to put up a PR
>>>>>
>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>>> be a very sticky toggle since it'd be easy for PTransforms to 
>>>>>> accidentally
>>>>>> rely on it.
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>> togglable with an option now. We should probably add the option to 
>>>>>>> toggle
>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike 
>>>>>>>> the
>>>>>>>> BeamPython example I posted before, it just warns instead of throwing 
>>>>>>>> an
>>>>>>>> exception.
>>>>>>>>
>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>> don't have a way to see the outputs of a run example? I have a vague 
>>>>>>>> memory
>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>> generated but not sure if I just dreamt that up. Playing with the 
>>>>>>>> examples,
>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on 
>>>>>>>> the
>>>>>>>> stdout alone.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>> streaming or batch, or if it will be updated in the future, so the 
>>>>>>>>> decision
>>>>>>>>> was made to impose this restriction up front. Both will auto-generate 
>>>>>>>>> a
>>>>>>>>> name for you if one is not given, but will do so deterministically 
>>>>>>>>> (not
>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>
>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>> so the naming only has to be distinct within a composite transform 
>>>>>>>>> (or at
>>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>>> transform).
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>>> causes the duplication error.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is 
>>>>>>>>>>>>>> the same you
>>>>>>>>>>>>>> must specify a label. For example, the below will raise an 
>>>>>>>>>>>>>> exception:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my 
>>>>>>>>>>>>>> inexperience.
>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll 
>>>>>>>>>>>>>> into three
>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs 
>>>>>>>>>>>>>> the same type
>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often 
>>>>>>>>>>>>>> users will
>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often 
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>> writing unit tests that test a few different properties of their 
>>>>>>>>>>>>>> workflow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an 
>>>>>>>>>>>>>>>> obvious way
>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK 
>>>>>>>>>>>>>>>> supports it makes
>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... 
>>>>>>>>>>>>>>>> (If so, I'd
>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious 
>>>>>>>>>>>>>>>> to have to
>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a 
>>>>>>>>>>>>>>>> pipeline, e.g.
>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, 
>>>>>>>>>>>>>>>>> the only way
>>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which 
>>>>>>>>>>>>>>>>> transforms in
>>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no 
>>>>>>>>>>>>>>>>> good way to auto
>>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even 
>>>>>>>>>>>>>>>>> small changes to
>>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, 
>>>>>>>>>>>>>>>>> which could
>>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply 
>>>>>>>>>>>>>>>>> (if using
>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto 
>>>>>>>>>>>>>>>>> generate a
>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using 
>>>>>>>>>>>>>>>>>> multiple of the same
>>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a 
>>>>>>>>>>>>>>>>>> common complaint,
>>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design 
>>>>>>>>>>>>>>>>>> was? My naive
>>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user 
>>>>>>>>>>>>>>>>>> friendly to just
>>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be 
>>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Reply via email to