This is your daily summary of Beam's current high priority issues that may need
attention.
See https://beam.apache.org/contribute/issue-priorities for the meaning and
expectations around issue priorities.
Unassigned P1 Issues:
https://github.com/apache/beam/issues/23179 [Bug]: Parquet size
I'm wondering if it would make sense to have a built-in Beam transformation
for calculating the Cartesian product of PCollections.
Just this past week, I've encountered two separate cases where calculating
a Cartesian product was a bottleneck. The in-memory option of using
something like Python's
In SQL we just don't support cross joins currently [1]. I'm not aware of an
existing implementation of a cross join/cartesian product.
> My team has an internal implementation of a CartesianProduct transform,
based on using hashing to split a pcollection into a finite number of
groups and CoGroupB
If one of your inputs fits into memory, using side inputs is
definitely the way to go. If neither side fits into memory, the cross
product may be prohibitively large to compute even on a distributed
computing platform (a billion times a billion is big, though I suppose
one may hit memory limits wit
>
> > > My team has an internal implementation of a CartesianProduct
> transform, based on using hashing to split a pcollection into a finite
> number of groups and CoGroupByKey.
> >
> > Could this be contributed to Beam?
>
If it would be of broader interest, I would be happy to work on this for
t
On Mon, Sep 19, 2022 at 1:53 PM Stephan Hoyer wrote:
>>
>> > > My team has an internal implementation of a CartesianProduct transform,
>> > > based on using hashing to split a pcollection into a finite number of
>> > > groups and CoGroupByKey.
>> >
>> > Could this be contributed to Beam?
>
>
> I
Many of my Beam pipelines start with partitioning over some large,
statically known number of inputs that could be created from a list of
sequential integers.
In Python, these sequential integers can be efficiently represented with a
range() object, which stores the start/top and interval. However
I got to thinking about this again and ran some benchmarks. The result is
documented in the GitHub issue [1].
tl;dr: we can't realize a huge benefit since we don't actually have an
out-of-band path for exchanging the buffers. However, pickle 5 can yield
improved in-band performance as well, and I