Re: Query about autinference of numPartitions for `JdbcIO#readWithPartitions`

2024-05-31 Thread XQ Hu via user
You should be able to configure the number of partition like this:

https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Java/src/main/java/jdbc/ReadPartitionsJdbc.java#L132

The code to  auto infer the number of partitions seems to be unreachable (I
haven't checked this carefully). More details are here:
https://issues.apache.org/jira/browse/BEAM-12456

On Fri, May 31, 2024 at 7:40 AM Vardhan Thigle via user <
user@beam.apache.org> wrote:

> Hi Beam Experts,I have a small query about `JdbcIO#readWithPartitions`
>
>
> ContextJdbcIO#readWithPartitions seems to always default
> to 200 partitions (DEFAULT_NUM_PARTITIONS). This is set by default when the
> object is constructed here
> 
> There seems to be no way to override this with a null value. Hence it
> seems that, the code
> 
>  that
> checks the null value and tries to auto infer the number of partitions
> based on the never runs.I am trying to use this for reading a tall table
> of unknown size, and the pipeline always defaults to 200 if the value is
> not set.  The default of 200 seems to fall short as worker goes out of
> memory in reshuffle stage. Running with higher number of partitions like 4K
> helps for my test setup.Since the size is not known at the time of
> implementing the pipeline, the auto-inference might help
> setting maxPartitions to a reasonable value as per the heuristic decided by
> Beam code.
> Request for help
>
> Could you please clarify a few doubts around this?
>
>1. Is this behavior intentional?
>2. Could you please explain the rationale behind the heuristic in L1398
>
> 
> and DEFAULT_NUM_PARTITIONS=200?
>
>
> I have also raised this as issues/31467 incase it needs any changes in
> the implementation.
>
>
> Regards and Thanks,
> Vardhan Thigle,
> +919535346204 <+91%2095353%2046204>
>


Query about autinference of numPartitions for `JdbcIO#readWithPartitions`

2024-05-31 Thread Vardhan Thigle via user
Hi Beam Experts,I have a small query about `JdbcIO#readWithPartitions`


ContextJdbcIO#readWithPartitions seems to always default to 200 partitions
(DEFAULT_NUM_PARTITIONS). This is set by default when the object is
constructed here

There seems to be no way to override this with a null value. Hence it seems
that, the code

that
checks the null value and tries to auto infer the number of partitions
based on the never runs.I am trying to use this for reading a tall table of
unknown size, and the pipeline always defaults to 200 if the value is not
set.  The default of 200 seems to fall short as worker goes out of memory
in reshuffle stage. Running with higher number of partitions like 4K helps
for my test setup.Since the size is not known at the time of implementing
the pipeline, the auto-inference might help setting maxPartitions to a
reasonable value as per the heuristic decided by Beam code.
Request for help

Could you please clarify a few doubts around this?

   1. Is this behavior intentional?
   2. Could you please explain the rationale behind the heuristic in L1398
   

and DEFAULT_NUM_PARTITIONS=200?


I have also raised this as issues/31467 incase it needs any changes in the
implementation.


Regards and Thanks,
Vardhan Thigle,
+919535346204