Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread Robert Bradshaw via user
Reshuffle is perfectly fine to use if the goal is just to redistribute
work. It's only deprecated as a "checkpointing" mechanism.

On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user
 wrote:
>
> For runners that support Reshuffle, it should be safe to use. Its been 
> "deprecated" for 7 years, but is still heavily used/often the recommended way 
> to do things like this. I actually just added a PR to undeprecate it earlier 
> today. Looks like you're using Dataflow, which also has always supported 
> ReShuffle.
>
> > Also I looked at the code, reshuffle seems doing some groupby work 
> > internally. But I don't really need groupby
>
> Groupby is basically an implementation detail that creates the desired 
> shuffling behavior in many runners (runners can also override transform 
> implementations if needed for some primitives like this, but that's another 
> can of worms). Basically, in order to prevent fusion you need some operation 
> that does this and GroupBy is one option.
>
> Given that you're using DataFlow, I'd also recommend checking out 
> https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion 
> which describes how to do this in more detail.
>
> Thanks,
> Danny
>
> On Fri, Jan 19, 2024 at 12:36 PM hsy...@gmail.com  wrote:
>>
>> Also I looked at the code, reshuffle seems doing some groupby work 
>> internally. But I don't really need groupby
>>
>> On Fri, Jan 19, 2024 at 9:35 AM hsy...@gmail.com  wrote:
>>>
>>> ReShuffle is deprecated
>>>
>>> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user  wrote:

 I do not think it enforces a reshuffle by just checking the doc here: 
 https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys

 Have you tried to just add ReShuffle after PubsubLiteIO?

 On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com  wrote:
>
> Hey guys,
>
> I have a question, does withkeys transformation enforce a reshuffle?
>
> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo() 
> -> BigqueryIO.write()
>
> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused 
> together. But The ParDo is expensive and I want dataflow to have more 
> workers to work on that, what's the best way to do that?
>
> Regards,
>


Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread Danny McCormick via user
For runners that support Reshuffle, it should be safe to use. Its been
"deprecated" for 7 years, but is still heavily used/often the recommended
way to do things like this. I actually just added a PR
 to undeprecate it earlier
today. Looks like you're using Dataflow, which also has always supported
ReShuffle
.

> Also I looked at the code, reshuffle seems doing some groupby work
internally. But I don't really need groupby

Groupby is basically an implementation detail that creates the desired
shuffling behavior in many runners (runners can also override transform
implementations if needed for some primitives like this, but that's another
can of worms). Basically, in order to prevent fusion you need some
operation that does this and GroupBy is one option.

Given that you're using DataFlow, I'd also recommend checking out
https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion which
describes how to do this in more detail.

Thanks,
Danny

On Fri, Jan 19, 2024 at 12:36 PM hsy...@gmail.com  wrote:

> Also I looked at the code, reshuffle seems doing some groupby work
> internally. But I don't really need groupby
>
> On Fri, Jan 19, 2024 at 9:35 AM hsy...@gmail.com  wrote:
>
>> ReShuffle is deprecated
>>
>> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user 
>> wrote:
>>
>>> I do not think it enforces a reshuffle by just checking the doc here:
>>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys
>>>
>>> Have you tried to just add ReShuffle after PubsubLiteIO?
>>>
>>> On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com 
>>> wrote:
>>>
 Hey guys,

 I have a question, does withkeys transformation enforce a reshuffle?

 My pipeline basically look like this PubsubLiteIO -> ParDo(..) ->
 ParDo() -> BigqueryIO.write()

 The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused
 together. But The ParDo is expensive and I want dataflow to have more
 workers to work on that, what's the best way to do that?

 Regards,




Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread hsy...@gmail.com
Also I looked at the code, reshuffle seems doing some groupby work
internally. But I don't really need groupby

On Fri, Jan 19, 2024 at 9:35 AM hsy...@gmail.com  wrote:

> ReShuffle is deprecated
>
> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user 
> wrote:
>
>> I do not think it enforces a reshuffle by just checking the doc here:
>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys
>>
>> Have you tried to just add ReShuffle after PubsubLiteIO?
>>
>> On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com 
>> wrote:
>>
>>> Hey guys,
>>>
>>> I have a question, does withkeys transformation enforce a reshuffle?
>>>
>>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) ->
>>> ParDo() -> BigqueryIO.write()
>>>
>>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused
>>> together. But The ParDo is expensive and I want dataflow to have more
>>> workers to work on that, what's the best way to do that?
>>>
>>> Regards,
>>>
>>>


Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread hsy...@gmail.com
ReShuffle is deprecated

On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user  wrote:

> I do not think it enforces a reshuffle by just checking the doc here:
> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys
>
> Have you tried to just add ReShuffle after PubsubLiteIO?
>
> On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com  wrote:
>
>> Hey guys,
>>
>> I have a question, does withkeys transformation enforce a reshuffle?
>>
>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo()
>> -> BigqueryIO.write()
>>
>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused
>> together. But The ParDo is expensive and I want dataflow to have more
>> workers to work on that, what's the best way to do that?
>>
>> Regards,
>>
>>


Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread XQ Hu via user
I do not think it enforces a reshuffle by just checking the doc here:
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys

Have you tried to just add ReShuffle after PubsubLiteIO?

On Thu, Jan 18, 2024 at 8:54 PM hsy...@gmail.com  wrote:

> Hey guys,
>
> I have a question, does withkeys transformation enforce a reshuffle?
>
> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo()
> -> BigqueryIO.write()
>
> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused together.
> But The ParDo is expensive and I want dataflow to have more workers to work
> on that, what's the best way to do that?
>
> Regards,
>
>


Does withkeys transform enforce a reshuffle?

2024-01-18 Thread hsy...@gmail.com
Hey guys,

I have a question, does withkeys transformation enforce a reshuffle?

My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo()
-> BigqueryIO.write()

The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused together.
But The ParDo is expensive and I want dataflow to have more workers to work
on that, what's the best way to do that?

Regards,