date:20220919

Beam High Priority Issue Report (74)

2022-09-19 Thread beamactions

This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/23179 [Bug]: Parquet size

Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev

I'm wondering if it would make sense to have a built-in Beam transformation for calculating the Cartesian product of PCollections. Just this past week, I've encountered two separate cases where calculating a Cartesian product was a bottleneck. The in-memory option of using something like Python's

Re: Cartesian product of PCollections

2022-09-19 Thread Brian Hulette via dev

In SQL we just don't support cross joins currently [1]. I'm not aware of an existing implementation of a cross join/cartesian product. > My team has an internal implementation of a CartesianProduct transform, based on using hashing to split a pcollection into a finite number of groups and CoGroupB

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev

If one of your inputs fits into memory, using side inputs is definitely the way to go. If neither side fits into memory, the cross product may be prohibitively large to compute even on a distributed computing platform (a billion times a billion is big, though I suppose one may hit memory limits wit

Re: Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev

> > > > My team has an internal implementation of a CartesianProduct > transform, based on using hashing to split a pcollection into a finite > number of groups and CoGroupByKey. > > > > Could this be contributed to Beam? > If it would be of broader interest, I would be happy to work on this for t

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev

On Mon, Sep 19, 2022 at 1:53 PM Stephan Hoyer wrote: >> >> > > My team has an internal implementation of a CartesianProduct transform, >> > > based on using hashing to split a pcollection into a finite number of >> > > groups and CoGroupByKey. >> > >> > Could this be contributed to Beam? > > > I

beam.Create(range(N)) without building a sequence in memory

2022-09-19 Thread Stephan Hoyer via dev

Many of my Beam pipelines start with partitioning over some large, statically known number of inputs that could be created from a list of sequential integers. In Python, these sequential integers can be efficiently represented with a range() object, which stores the start/top and interval. However

Re: Out of band pickling in Python (pickle5)

2022-09-19 Thread Brian Hulette via dev

I got to thinking about this again and ran some benchmarks. The result is documented in the GitHub issue [1]. tl;dr: we can't realize a huge benefit since we don't actually have an out-of-band path for exchanging the buffers. However, pickle 5 can yield improved in-band performance as well, and I

Beam High Priority Issue Report (74)

Cartesian product of PCollections

Re: Cartesian product of PCollections

Re: Cartesian product of PCollections

Re: Cartesian product of PCollections

Re: Cartesian product of PCollections

beam.Create(range(N)) without building a sequence in memory

Re: Out of band pickling in Python (pickle5)

8 matches

Site Navigation

Mail list logo

Footer information