[GitHub] beam pull request #2883: [BEAM-2154] Make BigQuery's dynamic-destination sup...

2017-05-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/2883


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] beam pull request #2883: [BEAM-2154] Make BigQuery's dynamic-destination sup...

2017-05-03 Thread reuvenlax
GitHub user reuvenlax opened a pull request:

https://github.com/apache/beam/pull/2883

[BEAM-2154] Make BigQuery's dynamic-destination support scale to large 
numbers of destinations

 Generating hundreds or thousands of file write buffers in a single bundle 
was causing workers to crash with out of memory. We now detect when too many 
files have been written in a bundle, and spill the remaining records to another 
PCollection. This PCollection is then grouped by destination before we write 
the remaining data to files. We shard destination keys 10 ways to prevent 
hotspotting. Tests of up to 10TB of data (going from 20 output tables up to 
4000) were run, and a sharding factor of 10 seems to work quite well on all 
runs (and is noticeably faster than not sharding)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/reuvenlax/incubator-beam bigquery_scalability

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/2883.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2883


commit 45eb1f8ec1f84a3eefd6d85539e9dc433be4842f
Author: Reuven Lax 
Date:   2017-04-29T14:33:54Z

If too many tables are generated in a bundle, spill and group the results 
before writing files. Generating hundreds or thousands of file write buffers in 
a single bundle was causing workers to crash with out of memory.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---