This is an automated email from the ASF dual-hosted git repository.
iemejia pushed a change to branch spark-runner_structured-streaming
in repository https://gitbox.apache.org/repos/asf/beam.git.
discard 6bf5a93 Use "sparkMaster" in local mode to obtain number of shuffle
partitions + spotless apply
discard 9de62ab Improve Pardo translation performance: avoid calling a filter
transform when there is only one output tag
discard 9472a41 Add a TODO on perf improvement of Pardo translation
discard 53374e1 After testing performance and correctness, launch pipeline
with dataset.foreach(). Make both test mode and production mode use foreach for
uniformity. Move dataset print as a utility method
discard 9b8fd33 Remove no more needed AggregatorCombinerPerKey (there is only
AggregatorCombiner)
discard d6830fb fixup! Add PipelineResults to Spark structured streaming.
discard 8600bad Print number of leaf datasets
discard 2be3d94 Add spark execution plans extended debug messages.
discard eb3c7d7 Update log4j configuration
discard ba68f6c Add PipelineResults to Spark structured streaming.
discard 8a9d97c Make spotless happy
discard 3fd985f Added metrics sinks and tests
discard 9879bc9 Persist all output Dataset if there are multiple outputs in
pipeline Enabled Use*Metrics tests
discard f1f58a9 Add a test to check that CombineGlobally preserves windowing
discard 6d4a75d Fix accumulators initialization in Combine that prevented
CombineGlobally to work.
discard b11bd8b Fix javadoc
discard 186e8ac Add setEnableSparkMetricSinks() method
discard 9472863 Add missing dependencies to run Spark Structured Streaming
Runner on Nexmark
discard 6b66509 Add metrics support in DoFn
discard 8943636 Ignore for now not working test testCombineGlobally
discard 7dde4fb Add a test that combine per key preserves windowing
discard 9a7e5a3 Clean groupByKeyTest
discard 94a7be6 add comment in combine globally test
discard d263ff3 Fixed immutable list bug
discard 33c1554 Fix javadoc of AggregatorCombiner
discard 1729bb7 Clean not more needed WindowingHelpers
discard 33f1a39 Clean not more needed RowHelpers
discard 11d504b Clean no more needed KVHelpers
discard 3ff116a Now that there is only Combine.PerKey translation, make only
one Aggregator
discard 5adeca5 Remove CombineGlobally translation because it is less
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
discard 4e9c9e1 Remove the mapPartition that adds a key per partition because
otherwise spark will reduce values per key instead of globally
discard 47f3aa5 Fix bug in the window merging logic
discard be71f05 Fix wrong encoder in combineGlobally GBK
discard 29a8e83 Fix case when a window does not merge into any other window
discard 6353f9c Apply a groupByKey avoids for some reason that the spark
structured streaming fmwk casts data to Row which makes it impossible to
deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
discard 8ad7d4f Revert extractKey while combinePerKey is not done (so that it
compiles)
discard ec7ac95 Fix encoder in combine call
discard 200e6ce Implement merge accumulators part of CombineGlobally
translation with windowing
discard 7b4ba5b Output data after combine
discard 3c5eb1f Implement reduce part of CombineGlobally translation with
windowing
discard 55c6f20 Fix comment about schemas
discard 5e4e034 Update KVHelpers.extractKey() to deal with WindowedValue and
update GBK and CPK
discard 653351f Add TODO in Combine translations
discard 8b5424a Add a test that GBK preserves windowing
discard 9721a02 Improve visibility of debug messages
discard b4c70d9 re-enable reduceFnRunner timers for output
discard 60afc9c Re-code GroupByKeyTranslatorBatch to conserve windowing
instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
discard 258d565 Add comment about checkpoint mark
discard 018d781 Update windowAssignTest
discard 6468c43 Put back batch/simpleSourceTest.testBoundedSource
discard 988a6cf Consider null object case on RowHelpers, fixes empty side
inputs tests.
discard a990264 Pass transform based doFnSchemaInformation in ParDo
translation
discard 86d03a4 Fixes ParDo not calling setup and not tearing down if
exception on startBundle
discard 55b98f8 Limit the number of partitions to make tests go 300% faster
discard 0193941 Enable batch Validates Runner tests for Structured Streaming
Runner
discard 91a7fc6 Apply Spotless
discard 1f54355 Update javadoc
discard 07a123b implement source.stop
discard 9a460d0 Ignore spark offsets (cf javadoc)
discard e889206 Use PAssert in Spark Structured Streaming transform tests
discard 66f3928 Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
discard 57ca802 Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
discard e9da167 Fix logging levels in Spark Structured Streaming translation
discard b3cda3c Fix spotless issues after rebase
discard 26974f0 Pass doFnSchemaInformation to ParDo batch translation
discard 63364db Fix non-vendored imports from Spark Streaming Runner classes
discard ae3526c Merge Spark Structured Streaming runner into main Spark
module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
discard 96b9e8f Fix access level issues, typos and modernize code to Java 8
style
discard 9143b88 Disable never ending test SimpleSourceTest.testUnboundedSource
discard abaf2e2 Apply spotless and fix spotbugs warnings
discard 6d717d9 Deal with checkpoint and offset based read
discard b366b7f Continue impl of offsets for streaming source
discard 5f13f11 Clean streaming source
discard 646b416 Clean unneeded 0 arg constructor in batch source
discard 96898cc Specify checkpointLocation at the pipeline start
discard 3c5181f Add source streaming test
discard 56260a1 Add transformators registry in PipelineTranslatorStreaming
discard 6552a53 Add a TODO on spark output modes
discard c9f5de8 Implement first streaming source
discard 9394fb6 Add streaming source initialisation
discard b912e8d And unchecked warning suppression
discard 53c6409 Added TODO comment for ReshuffleTranslatorBatch
discard ef58a99 Added using CachedSideInputReader
discard 64ef766 Don't use Reshuffle translation
discard 753913e Fix CheckStyle violations
discard c00e1c4 Added SideInput support
discard 9dd08b7 Fix javadoc
discard 21bb0a6 Implement WindowAssignTest
discard cb1c0dd Implement WindowAssignTranslatorBatch
discard 0d2c5ca Cleaning
discard 2ece1d7 Fix encoder bug in combinePerkey
discard ba1b628 Add explanation about receiving a Row as input in the combiner
discard 5a6ae19 Use more generic Row instead of GenericRowWithSchema
discard c72aa4a Fix combine. For unknown reason GenericRowWithSchema is used
as input of combine so extract its content to be able to proceed
discard 2b4bfd3 Update test with Long
discard 2608e96 Fix various type checking issues in Combine.Globally
discard c3565ee Get back to classes in translators resolution because URNs
cannot translate Combine.Globally
discard 91f910f Cleaning
discard 0b96f4d Add CombineGlobally translation to avoid translating
Combine.perKey as a composite transform based on Combine.PerKey (which uses low
perf GBK)
discard 6d19f9f Introduce RowHelpers
discard 7447090 Add combinePerKey and CombineGlobally tests
discard e19f0c4 Fix combiner using KV as input, use binary encoders in place
of accumulatorEncoder and outputEncoder, use helpers, spotless
discard 5b39a09 Introduce WindowingHelpers (and helpers package) and use it
in Pardo, GBK and CombinePerKey
discard 31d80c4 Improve type checking of Tuple2 encoder
discard ecaf6c8 First version of combinePerKey
discard 3476c9e Extract binary schema creation in a helper class
discard 0deca18 Fix getSideInputs
discard 2231e1a Generalize the use of SerializablePipelineOptions in place of
(not serializable) PipelineOptions
discard 2de632f Rename SparkDoFnFilterFunction to DoFnFilterFunction for
consistency
discard 7a608ce Add a test for the most simple possible Combine
discard 059a45d Added "testTwoPardoInRow"
discard bc27442 Fix for test elements container in GroupByKeyTest
discard 91db78e Rename SparkOutputManager for consistency
discard cea7019 Fix kryo issue in GBK translator with a workaround
discard eb57488 Simplify logic of ParDo translator
discard 1c3575f Don't use deprecated sideInput.getWindowingStrategyInternal()
discard f4b21a8 Rename SparkSideInputReader class and rename pruneOutput() to
pruneOutputFilteredByTag()
discard 0a52721 Fixed Javadoc error
discard 3cabda9 Apply spotless
discard 21d66a7 Fix Encoders: create an Encoder for every manipulated type to
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders
with generic types such as WindowedValue<T> and get the type checking back
discard 97a7a3e Fail in case of having SideInouts or State/Timers
discard 093cd1e Add ComplexSourceTest
discard 94ad34e Remove no more needed putDatasetRaw
discard ffa8d02 Port latest changes of ReadSourceTranslatorBatch to
ReadSourceTranslatorStreaming
discard 8bad879 Fix type checking with Encoder of WindowedValue<T>
discard 8f7610b Add comments and TODO to GroupByKeyTranslatorBatch
discard f5415fb Add GroupByKeyTest
discard 584563c Clean
discard 70c0b37 Address minor review notes
discard 3563aba Add ParDoTest
discard e7a00db Clean
discard dba34e1 Fix split bug
discard b1e0801 Remove bundleSize parameter and always use spark default
parallelism
discard 0586a34 Cleaning
discard b11be3e Fix testMode output to comply with new binary schema
discard 54190bf Fix errorprone
discard 0434229 Comment schema choices
discard fe966cf Serialize windowedValue to byte[] in source to be able to
specify a binary dataset schema and deserialize windowedValue from Row to get a
dataset<WindowedValue>
discard 1933059 First attempt for ParDo primitive implementation
discard 893231a Add flatten test
discard b86e2c9 Enable gradle build scan
discard ab3b891 Enable test mode
discard b9ecb10 Put all transform translators Serializable
discard 1f199db Simplify beam reader creation as it created once the source
as already been partitioned
discard b005fd0 Fix SourceTest
discard 465040b Move SourceTest to same package as tested class
discard 253c9bd Add serialization test
discard 6e48777 Add SerializationDebugger
discard fb04533 Fix serialization issues
discard 5043370 Clean unneeded fields in DatasetReader
discard af8202e improve readability of options passing to the source
discard 874565f Fix pipeline triggering: use a spark action instead of
writing the dataset
discard 7a9c2c6 Refactor SourceTest to a UTest instaed of a main
discard d45bfbb Checkstyle and Findbugs
discard d550ecf Clean
discard 182babd Add empty 0-arg constructor for mock source
discard e9e6692 Add a dummy schema for reader
discard 9cab04c Apply spotless and fix checkstyle
discard 2b4ffdf Use new PipelineOptionsSerializationUtils
discard c029ebe Add missing 0-arg public constructor
discard 3bf20b5 Wire real SourceTransform and not mock and update the test
discard 5cab09d Refactor DatasetSource fields
discard 5b9e4b9 Pass Beam Source and PipelineOptions to the spark DataSource
as serialized strings
discard 2cb1b75 Move Source and translator mocks to a mock package.
discard 35568ab Add ReadSourceTranslatorStreaming
discard 5e00117 Clean
discard 312333f Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
discard 3088c3c Split batch and streaming sources and translators
discard 85b33a5 Run pipeline in batch mode or in streaming mode
discard 2c0fd81 Move DatasetSourceMock to proper batch mode
discard 5bd0039 clean deps
discard a2a0665 Use raw WindowedValue so that spark Encoders could work
(temporary)
discard 4da27b0 fix mock, wire mock in translators and create a main test.
discard 6682665 Add source mocks
discard e084b11 Experiment over using spark Catalog to pass in Beam Source
through spark Table
discard a07ce40 Improve type enforcement in ReadSourceTranslator
discard 39a8b7d Improve exception flow
discard af4cd26 start source instanciation
discard e1c37ac Apply spotless
discard ba25b20 update TODO
discard 6ca4265 Implement read transform
discard 06a9891 Use Iterators.transform() to return Iterable
discard fd56501 Add primitive GroupByKeyTranslatorBatch implementation
discard 9402def Add Flatten transformation translator
discard a041977 Create Datasets manipulation methods
discard 4179dbd Create PCollections manipulation methods
discard 8a2bad8 Add basic pipeline execution. Refactor translatePipeline() to
return the translationContext on which we can run startPipeline()
discard 11d8b23 Added SparkRunnerRegistrar
discard 97eb0de Add precise TODO for multiple TransformTranslator per
transform URN
discard 07090b2 Post-pone batch qualifier in all classes names for readability
discard 63d3ce0 Add TODOs
discard 47b05fd Make codestyle and firebug happy
discard d0137b9 apply spotless
discard 4ccc2cb Move common translation context components to superclass
discard 1cb7f3b Move SparkTransformOverrides to correct package
discard 03fe418 Improve javadocs
discard 9930dc4 Make transform translation clearer: renaming, comments
discard 992883b Refactoring: -move batch/streaming common translation visitor
and utility methods to PipelineTranslator -rename batch dedicated classes to
Batch* to differentiate with their streaming counterparts -Introduce
TranslationContext for common batch/streaming components
discard eacf593 Initialise BatchTranslationContext
discard 92a63de Organise methods in PipelineTranslator
discard 71738cb Renames: better differenciate pipeline translator for
transform translator
discard ccf25c2 Wire node translators with pipeline translator
discard c555920 Add nodes translators structure
discard 806d1b8 Add global pipeline translation structure
discard ed1013d Start pipeline translation
discard b766961 Add SparkPipelineOptions
discard 3519f0f Fix missing dep
discard 10ae821 Add an empty spark-structured-streaming runner project
targeting spark 2.4.0
add 851339e [BEAM-7869] Add load testing jobs to README.md
add 95bd48c Merge pull request #9230 from
mwalenia/BEAM-7869-load-tests-readme
add 3ca9149 [BEAM-7844] Implementing NodeStat Estimations for all the
nodes
add 7d56a23 Merge pull request #9198 from riazela/RowRateWindowEstimation
add 90d6ba1 [BEAM-7814] Fix BigqueryMatcher._matches when called more
than once.
add 31c8e87 Merge pull request #9224 from udim/bq-matcher-fix
add 1031fdf Fix minor typos (#9192)
add 76424e7 Update Python dependencies page for 2.14.0
add c6c3bce Merge pull request #9055 from rosetn/patch-4
add 30661c1 [BEAM-7079] Add Chicago Taxi Example runner script with
Python scripts
add 846db7d [BEAM-7079] Add Chicago Taxi Example gradle task and Jenkins
job
add dd4283f Merge pull request #8939 from
mwalenia/BEAM-7079-chicago-taxi-example
add 2c8bf9f [BEAM-7868] optionally skip hidden pipeline options
add 47cf7ee Merge pull request #9211 from angoenka/hidden_options
add 7a0bc8d [BEAM-7577] Allow ValueProviders in Datastore Query filters
(#8950)
add d349553 Retry Datastore writes on [Errno 32] Broken pipe
add fef16e9 Merge pull request #8346: [BEAM-7476] Retry Datastore writes
on [Errno 32] Broken pipe
add 149153b [BEAM-7060] Introduce Python3-only test modules (#9223)
add b5e4175 [BEAM-7877] Change the log level when deleting unknown
temoprary files in FileBasedSink
add dbcd265 Merge pull request #9227 from ihji/BEAM-7877
add 2c1932a Update design-documents.md
add 6976651 [BEAM-7889] Update RC validation guide due to
run_rc_validation.sh change
add cca9e6d Merge pull request #9241: [BEAM-7889] Update RC validation
guide due to run_rc_validation.sh change
add 522190f [BEAM-7060] Fix
:sdks:python:apache_beam:testing:load_tests:run breakage
add ea08174 Merge pull request #9239: [BEAM-7060] Fix
:sdks:python:apache_beam:testing:load_tests:run breakage
add c5c7076 Fix some copy paste errors
add cb27549 Merge pull request #9194 from TheNeuralBit/coder-cleanup
add 01f3606 [Java] remove unneeded junit dependency.
add 917a613 Merge pull request #9208 from
amaliujia/rw-remove-unneeded-junit
add fc9e0aa [BEAM-7894] Upgrade AWS KPL to version 0.13.1
add e70eb15 Merge pull request #9249: [BEAM-7894] Upgrade AWS KPL to
version 0.13.1
add 9cf6f0d [BEAM-7898] Remove default implementation of getRowCount and
change the name to getTableStatistics
add 0d911b8 Merge pull request #9254 from riazela/TablesStatEstimation
add ec8b65a [BEAM-7389] Add helper conversion samples and simplified tests
add 913f065 Merge pull request #9252 from
davidcavazos/element-wise-with-timestamps
add 08d0146 [BEAM-7607] Per user request, making maxFilesPerBundle public
(#9160)
add 2387cbc [BEAM-7880] Upgrade Jackson databind to version 2.9.9.3
add 8a33913 Merge pull request #9229: [BEAM-7880] Upgrade Jackson
databind to version 2.9.9.3
add 00eef79 [BEAM-7721] Add a new module with BigQuery IO Read
performance test.
add 752d163 [BEAM-7721] Add cleanup to test and change the way of
reporting metrics
add 997bc70 [BEAM-7721] Refactor metric gathering
add 49644d3 Merge pull request #9041: [BEAM-7721] Add a new module with
BigQuery IO Read performance test
add 6252b66 Update Dataflow container images used by unreleased (dev) SDK.
add bd54ac5 Merge pull request #9243 from tvalentyn/patch-59
add a6a57df Increase default chunk size for gRPC commit and get data
streams. The initial choice of chunk size was arbitrary, and there is evidence
from testing that larger chunks improve performance.
add 7618f93 [BEAM-7901] Increase gRPC stream chunk sizes - fix test
add 6159162 [BEAM-7901] Increase default chunk size for gRPC commit and
get data streams.
add e1b2022 Adapt Jet Runner page to runner being released now
add a5c8ae0 Merge pull request #9245: [BEAM-7305] Adapt Jet Runner page
to runner being released now
add b47bc95 [BEAM-7777] Wiring up BeamCostModel
add 5fb50c7 [BEAM-7777] Implementing beamComputeSelfCost for all the rel
nodes
add a694fda Merge pull request #9217 from riazela/BeamCostModel
add 40936ba BEAM-7018: Added regex transform on Python SDK.
add 0a2ddc0 Merge pull request #8859 from mszb/BEAM-7018
add 3d75784 [BEAM-7845] Install Python SDK using pip instead of
setuptools.
add f1996da Merge pull request #9274 from tvalentyn/patch-60
add ee0d83d Update Python 2.7.x restriction
add dda2061 Merge pull request #9134 from rosetn/patch-5
add 24e9ced [BEAM-7833] warn user when --region flag is not explicitly
set (#9173)
add 164f249 Update env variable name in sync script
add b45bc6f [BEAM-6683] add createCrossLanguageValidatesRunner task
add 9678149 Merge pull request #8174: [BEAM-6683] add
createCrossLanguageValidatesRunner task
add 6fa94c9 [BEAM-7862] Moving FakeBigQueryServices to published
artifacts (#9206)
add 7ea9825 [BEAM-7899] Fix Python Dataflow VR tests by specify
sdk_location
add ff0f308 Merge pull request #9269: [BEAM-7899] Fix Python Dataflow VR
tests by specifying sdk_location
add bc2c6ff [BEAM-7678] Fixes bug in output element_type generation in Kv
PipelineVisitor (#9238)
add 1a3eed5 [BEAM-7389] Add code examples for MapTuple and FlatMapTuple
add 5e6ec5d5 Merge pull request #9276 from davidcavazos/map-flatmap-tuple
add 41d6dd9 [BEAM-7915] show cross-language validate runner Flink badge
on github PR template
add d9b43fc Merge pull request #9282 from ihji/BEAM-7915
add 124e6b6 Update python-pipeline-dependencies.md
add 28a4057 Merge pull request #9291 from rosetn/patch-8
add 1d20042 [BEAM-7860] Python Datastore: fix key sort order
add 0325c36 Merge pull request #9240: [BEAM-7860] Python Datastore: fix
key sort order
add eb8a29c Update python-sdk.md
add 2230cc9 Merge pull request #9290 from rosetn/patch-7
add 5504aa7 [BEAM-7918] adding nested row implementation for unnest and
uncollect (#9288)
add f792e2e Add helper functions for reading and writing to PubSub
directly from Python (#9212)
add 1bc2dd8 [BEAM-7060] Use typing in type decorators of core.py
add 67c8e9c Fully qualify use of internal typehints.
add 212596e Merge pull request #9179 [BEAM-7060] Use typing in type
decorators of core.py
add c9e5ea8 [BEAM -7741] Implement SetState for Python SDK (#9090)
add 57d8fd3 [BEAM-7912] Optimize GroupIntoBatches for batch Dataflow
pipelines.
add 7848ead [BEAM-7912] Optimize GroupIntoBatches for batch Dataflow
pipelines.
add f1dc92f [BEAM-7613] HadoopFileSystem can work with more than one
cluster.
add 25ac4ef [BEAM-7613] HadoopFileSystem can work with more than one
cluster.
add 0604e04 [BEAM-7924] Failure in Python 2 postcommit:
crossLanguagePythonJavaFlink
add 497bc77 Merge pull request #9292: [BEAM-7924] Failure in Python 2
postcommit: crossLanguagePythonJavaFlink
add 30b1ff0 [BEAM-7776] Create kubernetes.sh script to use kubectl with
desired namespace and kubeconfig file
add 22b97be [BEAM-7776] Stop using PerfkitBenchmarker in MongoDBIOIT job
add e1d4403 [BEAM-7776] Move repeatable kubernetes jenkins steps to
Kubernetes.groovy
add 0ecd166 Merge pull request #9116: [BEAM-7776] Stop Using Perfkit in
IOIT
add 13450c4 [BEAM-6907] Reuse Python tarball in integration tests
add 1545120 Add missing dependency and source copy in tox test
add b3c2915 Build sdk tarball before running installGcpTest task
add b69c81a [BEAM-6907] Reuse Python tarball in tox & dataflow
integration tests
add 936110f [BEAM-7820] Add basic hot key detection logging in Worker.
(#9270)
add ca050b9 use vendor-bytebuddy in sdks-java-core
add 3e97543 [BEAM-5822] Use vendored bytebuddy in sdks-java-core
add 459e270 Add assertArrayCountEqual, which checks if two containers
have the same elements (#9235)
add c164fbf Move DirectRunner specific classes into direct runner package.
add fffe83a Merge pull request #9297 Move DirectRunner specific classes
into direct runner package.
add 9d131d4 [BEAM-7896] Implementing RateEstimation for KafkaTable with
Unit and Integration Tests
add cd2ab9e Merge pull request #9298 from riazela/KafkaRateEstimation2
add 6454f72 [BEAM-7874], [BEAM-7873] Distributed FnApiRunner bugfixs
(#9218)
add ce3d121 [BEAM-7940] Fix beam_Release_Python_NightlySnapshot
add 59464cf Merge pull request #9307: [BEAM-7940] Fix
beam_Release_Python_NightlySnapshot
add dfd46d8 [BEAM-7846] add test for BEAM-7689
add a2b57e3 Merge pull request #9228 from ihji/BEAM-7846
new 7794f02 Add an empty spark-structured-streaming runner project
targeting spark 2.4.0
new 2558c22 Fix missing dep
new 253123c Add SparkPipelineOptions
new ee8875a Start pipeline translation
new d7df588 Add global pipeline translation structure
new 9fea139 Add nodes translators structure
new a5b9894 Wire node translators with pipeline translator
new 5990104 Renames: better differenciate pipeline translator for
transform translator
new 6f1fda8 Organise methods in PipelineTranslator
new 8abe116 Initialise BatchTranslationContext
new 31af0f1 Refactoring: -move batch/streaming common translation visitor
and utility methods to PipelineTranslator -rename batch dedicated classes to
Batch* to differentiate with their streaming counterparts -Introduce
TranslationContext for common batch/streaming components
new ae2392e Make transform translation clearer: renaming, comments
new 83f44e0 Improve javadocs
new 8875d48d Move SparkTransformOverrides to correct package
new 6ede65c Move common translation context components to superclass
new ff06081 apply spotless
new bf5619d Make codestyle and firebug happy
new 1c7d381 Add TODOs
new 7612d17 Post-pone batch qualifier in all classes names for readability
new cfd3abf Add precise TODO for multiple TransformTranslator per
transform URN
new cc07798 Added SparkRunnerRegistrar
new d75ac4a Add basic pipeline execution. Refactor translatePipeline() to
return the translationContext on which we can run startPipeline()
new b458c64 Create PCollections manipulation methods
new 8c1a012 Create Datasets manipulation methods
new 9d721a7 Add Flatten transformation translator
new 1298b77 Add primitive GroupByKeyTranslatorBatch implementation
new b0920e3 Use Iterators.transform() to return Iterable
new 3ae60a4 Implement read transform
new 238e57a update TODO
new 5701c75 Apply spotless
new 3cb2a76 start source instanciation
new b8cb742 Improve exception flow
new 6795e4c Improve type enforcement in ReadSourceTranslator
new b57367f Experiment over using spark Catalog to pass in Beam Source
through spark Table
new eb2fa49 Add source mocks
new bea28b1 fix mock, wire mock in translators and create a main test.
new cb03179 Use raw WindowedValue so that spark Encoders could work
(temporary)
new 44644d3 clean deps
new 4aa321c Move DatasetSourceMock to proper batch mode
new f9ed0dd Run pipeline in batch mode or in streaming mode
new 64c8202 Split batch and streaming sources and translators
new d6e905b Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
new b617ba4 Clean
new 08c05d6 Add ReadSourceTranslatorStreaming
new 5b0f9a2 Move Source and translator mocks to a mock package.
new 9d8dd90 Pass Beam Source and PipelineOptions to the spark DataSource
as serialized strings
new ca5f70c Refactor DatasetSource fields
new 2b64bd2 Wire real SourceTransform and not mock and update the test
new ca5a120 Add missing 0-arg public constructor
new 101f6f2 Use new PipelineOptionsSerializationUtils
new 02458a7 Apply spotless and fix checkstyle
new 9251dcb Add a dummy schema for reader
new d12cc14 Add empty 0-arg constructor for mock source
new 037db6e Clean
new bb830be Checkstyle and Findbugs
new 02933bd Refactor SourceTest to a UTest instaed of a main
new 0ffd98d Fix pipeline triggering: use a spark action instead of
writing the dataset
new 4936687 improve readability of options passing to the source
new 0e85242 Clean unneeded fields in DatasetReader
new 625056e Fix serialization issues
new 1720e6b Add SerializationDebugger
new 00f2e11 Add serialization test
new 74693b2 Move SourceTest to same package as tested class
new e54cbc6 Fix SourceTest
new 23ca155 Simplify beam reader creation as it created once the source
as already been partitioned
new 047af3e Put all transform translators Serializable
new 3dc4dc3 Enable test mode
new 3aea53b Enable gradle build scan
new fc54404 Add flatten test
new e0c8fbd First attempt for ParDo primitive implementation
new 675bf94 Serialize windowedValue to byte[] in source to be able to
specify a binary dataset schema and deserialize windowedValue from Row to get a
dataset<WindowedValue>
new e4d4b4f Comment schema choices
new 84bdf70 Fix errorprone
new b4230dc Fix testMode output to comply with new binary schema
new defd5e6 Cleaning
new e12226a Remove bundleSize parameter and always use spark default
parallelism
new 1d31831 Fix split bug
new f47bb0a Clean
new d4acf25 Add ParDoTest
new ae44706 Address minor review notes
new fe19d6c Clean
new 779e621 Add GroupByKeyTest
new fe7fb4e Add comments and TODO to GroupByKeyTranslatorBatch
new 14368ff Fix type checking with Encoder of WindowedValue<T>
new a8e50ad Port latest changes of ReadSourceTranslatorBatch to
ReadSourceTranslatorStreaming
new 21902b6 Remove no more needed putDatasetRaw
new a74d149 Add ComplexSourceTest
new 1b244e9 Fail in case of having SideInouts or State/Timers
new 7c5b778 Fix Encoders: create an Encoder for every manipulated type to
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders
with generic types such as WindowedValue<T> and get the type checking back
new 2f37994 Apply spotless
new 9bee53f Fixed Javadoc error
new 6d723b1 Rename SparkSideInputReader class and rename pruneOutput() to
pruneOutputFilteredByTag()
new b7a76d9 Don't use deprecated sideInput.getWindowingStrategyInternal()
new 3ba4bc6 Simplify logic of ParDo translator
new 314d935 Fix kryo issue in GBK translator with a workaround
new bc7fab4 Rename SparkOutputManager for consistency
new 2482172 Fix for test elements container in GroupByKeyTest
new 3716673 Added "testTwoPardoInRow"
new 674a048 Add a test for the most simple possible Combine
new 08b580b Rename SparkDoFnFilterFunction to DoFnFilterFunction for
consistency
new 4cbdbb8 Generalize the use of SerializablePipelineOptions in place of
(not serializable) PipelineOptions
new bfc10a2 Fix getSideInputs
new d33508e Extract binary schema creation in a helper class
new 2c465f8 First version of combinePerKey
new 4d3be61 Improve type checking of Tuple2 encoder
new 72d338b Introduce WindowingHelpers (and helpers package) and use it
in Pardo, GBK and CombinePerKey
new 684fc4a Fix combiner using KV as input, use binary encoders in place
of accumulatorEncoder and outputEncoder, use helpers, spotless
new 2a1d74e Add combinePerKey and CombineGlobally tests
new f48c109 Introduce RowHelpers
new 0a88819 Add CombineGlobally translation to avoid translating
Combine.perKey as a composite transform based on Combine.PerKey (which uses low
perf GBK)
new 8d0a8b5 Cleaning
new 3c25348 Get back to classes in translators resolution because URNs
cannot translate Combine.Globally
new b13839d Fix various type checking issues in Combine.Globally
new 00acd7d Update test with Long
new a72afd8 Fix combine. For unknown reason GenericRowWithSchema is used
as input of combine so extract its content to be able to proceed
new a2d1975 Use more generic Row instead of GenericRowWithSchema
new ac67ada Add explanation about receiving a Row as input in the combiner
new 1244549 Fix encoder bug in combinePerkey
new 383d58d Cleaning
new bf2af77 Implement WindowAssignTranslatorBatch
new 41e6a19 Implement WindowAssignTest
new 1355ece Fix javadoc
new d759a19 Added SideInput support
new c879337 Fix CheckStyle violations
new d8ee03e Don't use Reshuffle translation
new 530dfb0 Added using CachedSideInputReader
new cb1a99c Added TODO comment for ReshuffleTranslatorBatch
new 2ad1f15 And unchecked warning suppression
new 81c0bbe Add streaming source initialisation
new 4030fb0 Implement first streaming source
new cb5dffa Add a TODO on spark output modes
new ce46b9b Add transformators registry in PipelineTranslatorStreaming
new 4527615 Add source streaming test
new 6e94948 Specify checkpointLocation at the pipeline start
new 8caa982 Clean unneeded 0 arg constructor in batch source
new afa6a48 Clean streaming source
new b7c68bd Continue impl of offsets for streaming source
new 79e85ec Deal with checkpoint and offset based read
new c68e875 Apply spotless and fix spotbugs warnings
new 19e5fdf Disable never ending test SimpleSourceTest.testUnboundedSource
new c9a8c8c Fix access level issues, typos and modernize code to Java 8
style
new cb72394 Merge Spark Structured Streaming runner into main Spark
module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
new 2628393 Fix non-vendored imports from Spark Streaming Runner classes
new 58f97b8 Pass doFnSchemaInformation to ParDo batch translation
new 03eb450 Fix spotless issues after rebase
new 47376e3 Fix logging levels in Spark Structured Streaming translation
new fc877cd Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
new 9354339 Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
new d285dd1 Use PAssert in Spark Structured Streaming transform tests
new d2e65df Ignore spark offsets (cf javadoc)
new fdcd346 implement source.stop
new ab6a879 Update javadoc
new e4ef07b Apply Spotless
new c2896b6 Enable batch Validates Runner tests for Structured Streaming
Runner
new 2bfd3d5 Limit the number of partitions to make tests go 300% faster
new 1be0c8a Fixes ParDo not calling setup and not tearing down if
exception on startBundle
new a94045c Pass transform based doFnSchemaInformation in ParDo
translation
new cc7a52d Consider null object case on RowHelpers, fixes empty side
inputs tests.
new 68e3ae2 Put back batch/simpleSourceTest.testBoundedSource
new a3e29b4 Update windowAssignTest
new c23c07e Add comment about checkpoint mark
new fed93da Re-code GroupByKeyTranslatorBatch to conserve windowing
instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
new 47b7132 re-enable reduceFnRunner timers for output
new 28ee71c Improve visibility of debug messages
new 14c703f Add a test that GBK preserves windowing
new 5dc8c24 Add TODO in Combine translations
new edaa37f Update KVHelpers.extractKey() to deal with WindowedValue and
update GBK and CPK
new 595d9eb Fix comment about schemas
new 960d245 Implement reduce part of CombineGlobally translation with
windowing
new 28ba572 Output data after combine
new 70e3d66 Implement merge accumulators part of CombineGlobally
translation with windowing
new 6de9acf Fix encoder in combine call
new 4f4744a Revert extractKey while combinePerKey is not done (so that it
compiles)
new f36e5c3 Apply a groupByKey avoids for some reason that the spark
structured streaming fmwk casts data to Row which makes it impossible to
deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
new f00e7d5 Fix case when a window does not merge into any other window
new 4e34632 Fix wrong encoder in combineGlobally GBK
new 1589877 Fix bug in the window merging logic
new 7b6c914 Remove the mapPartition that adds a key per partition because
otherwise spark will reduce values per key instead of globally
new ad0c179 Remove CombineGlobally translation because it is less
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
new 569a9cb Now that there is only Combine.PerKey translation, make only
one Aggregator
new 4b5af9c Clean no more needed KVHelpers
new 3bd95df Clean not more needed RowHelpers
new 7784f30 Clean not more needed WindowingHelpers
new 9601649 Fix javadoc of AggregatorCombiner
new f68ed7a Fixed immutable list bug
new 8c499a5 add comment in combine globally test
new 6638522 Clean groupByKeyTest
new ff50ccb Add a test that combine per key preserves windowing
new d29c64e Ignore for now not working test testCombineGlobally
new 5ed3e03 Add metrics support in DoFn
new 51ca79a Add missing dependencies to run Spark Structured Streaming
Runner on Nexmark
new 476bc20 Add setEnableSparkMetricSinks() method
new a797884 Fix javadoc
new 6e9ccdd Fix accumulators initialization in Combine that prevented
CombineGlobally to work.
new dab3c2e Add a test to check that CombineGlobally preserves windowing
new dc939c8 Persist all output Dataset if there are multiple outputs in
pipeline Enabled Use*Metrics tests
new 4aaf456 Added metrics sinks and tests
new 0e36b19 Make spotless happy
new cde225a Add PipelineResults to Spark structured streaming.
new 3b15128 Update log4j configuration
new ec43374 Add spark execution plans extended debug messages.
new 8aafd50 Print number of leaf datasets
new f8a5046 fixup! Add PipelineResults to Spark structured streaming.
new 61f487f Remove no more needed AggregatorCombinerPerKey (there is only
AggregatorCombiner)
new 93d425a After testing performance and correctness, launch pipeline
with dataset.foreach(). Make both test mode and production mode use foreach for
uniformity. Move dataset print as a utility method
new 0cedc7a Add a TODO on perf improvement of Pardo translation
new a524036 Improve Pardo translation performance: avoid calling a filter
transform when there is only one output tag
new c350188 Use "sparkMaster" in local mode to obtain number of shuffle
partitions + spotless apply
This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version. This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:
* -- * -- B -- O -- O -- O (6bf5a93)
\
N -- N -- N refs/heads/spark-runner_structured-streaming
(c350188)
You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.
Any revisions marked "omit" are not gone; other references still
refer to them. Any revisions marked "discard" are gone forever.
The 208 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
.github/PULL_REQUEST_TEMPLATE.md | 1 +
.test-infra/jenkins/Kubernetes.groovy | 108 ++++++++
.test-infra/jenkins/README.md | 23 ++
.../job_PerformanceTests_MongoDBIO_IT.groovy | 83 +++---
.../jenkins/job_PerformanceTests_Python.groovy | 15 +-
...ommit_CrossLanguageValidatesRunner_Flink.groovy | 43 +++
...mit_Python_Chicago_Taxi_Example_Dataflow.groovy | 58 ++++
.test-infra/kubernetes/kubernetes.sh | 93 +++++++
.test-infra/metrics/beamgrafana-deploy.yaml | 18 +-
.test-infra/metrics/docker-compose.yml | 10 +-
.test-infra/metrics/sync/github/sync.py | 4 +-
.test-infra/metrics/sync/jenkins/syncjenkins.py | 11 +-
.../org/apache/beam/gradle/BeamModulePlugin.groovy | 187 ++++++++++---
examples/java/README.md | 2 +-
examples/kotlin/README.md | 2 +-
.../Core Transforms/Map/FlatMapElements/task.html | 2 +-
.../java/Core Transforms/Map/MapElements/task.html | 2 +-
ownership/JAVA_DEPENDENCY_OWNERS.yaml | 15 --
runners/core-construction-java/build.gradle | 21 --
.../graph/GreedyPCollectionFusers.java | 4 +-
.../runners/core/construction/ExternalTest.java | 47 ++--
.../expansion/TestExpansionService.java | 58 ----
runners/core-java/build.gradle | 1 -
runners/flink/job-server/flink_job_server.gradle | 30 +--
runners/google-cloud-dataflow-java/build.gradle | 20 +-
.../beam/runners/dataflow/DataflowRunner.java | 77 ++++++
.../dataflow/options/DataflowPipelineOptions.java | 4 +-
.../beam/runners/dataflow/DataflowRunnerTest.java | 63 ++++-
.../worker/DataflowWorkProgressUpdater.java | 55 +---
.../beam/runners/dataflow/worker/HotKeyLogger.java | 77 ++++++
.../dataflow/worker/StreamingDataflowWorker.java | 23 +-
.../worker/windmill/GrpcWindmillServer.java | 8 +-
.../worker/DataflowWorkProgressUpdaterTest.java | 71 ++---
.../worker/DataflowWorkUnitClientTest.java | 1 +
.../runners/dataflow/worker/HotKeyLoggerTest.java | 104 ++++++++
.../worker/StreamingDataflowWorkerTest.java | 15 +-
.../worker/windmill/GrpcWindmillServerTest.java | 2 +-
sdks/go/pkg/beam/pardo.go | 2 +-
sdks/go/pkg/beam/runners/dataflow/dataflow.go | 8 +-
sdks/java/core/build.gradle | 3 +-
.../apache/beam/sdk/coders/BigEndianLongCoder.java | 2 +-
.../beam/sdk/coders/BigEndianShortCoder.java | 2 +-
.../org/apache/beam/sdk/coders/FloatCoder.java | 2 +-
.../apache/beam/sdk/coders/RowCoderGenerator.java | 44 ++--
.../main/java/org/apache/beam/sdk/io/AvroIO.java | 2 +-
.../java/org/apache/beam/sdk/io/FileBasedSink.java | 5 +-
.../main/java/org/apache/beam/sdk/io/FileIO.java | 2 +-
.../beam/sdk/options/PipelineOptionsFactory.java | 4 +-
.../beam/sdk/options/PipelineOptionsReflector.java | 11 +-
.../beam/sdk/options/ProxyInvocationHandler.java | 2 +-
.../beam/sdk/schemas/utils/AutoValueUtils.java | 40 +--
.../beam/sdk/schemas/utils/AvroByteBuddyUtils.java | 22 +-
.../beam/sdk/schemas/utils/ByteBuddyUtils.java | 54 ++--
.../beam/sdk/schemas/utils/ConvertHelpers.java | 24 +-
.../beam/sdk/schemas/utils/JavaBeanUtils.java | 30 +--
.../apache/beam/sdk/schemas/utils/POJOUtils.java | 42 +--
.../beam/sdk/transforms/GroupIntoBatches.java | 5 +
.../java/org/apache/beam/sdk/transforms/ParDo.java | 4 +-
.../reflect/ByteBuddyDoFnInvokerFactory.java | 68 ++---
.../reflect/ByteBuddyOnTimerInvokerFactory.java | 32 +--
.../reflect/StableInvokerNamingStrategy.java | 4 +-
.../beam/sdk/transforms/windowing/PaneInfo.java | 2 +-
.../org/apache/beam/sdk/io/FileBasedSinkTest.java | 16 ++
.../java/org/apache/beam/sdk/io/SimpleSink.java | 4 +
.../sdk/options/PipelineOptionsReflectorTest.java | 20 +-
.../beam/sdk/transforms/ParDoLifecycleTest.java | 2 +-
sdks/java/extensions/sql/build.gradle | 1 +
.../sql/meta/provider/hcatalog/HCatalogTable.java | 7 +
.../beam/sdk/extensions/sql/BeamSqlTable.java | 9 +-
.../sdk/extensions/sql/impl/BeamCalciteTable.java | 2 +-
.../extensions/sql/impl/BeamTableStatistics.java | 2 +-
.../extensions/sql/impl/CalciteQueryPlanner.java | 26 +-
.../extensions/sql/impl/planner/BeamCostModel.java | 253 ++++++++++++++++++
.../sql/impl/planner/NodeStatsMetadata.java | 4 +-
.../sql/impl/rel/BeamAggregationRel.java | 61 ++++-
.../sdk/extensions/sql/impl/rel/BeamCalcRel.java | 30 ++-
.../sdk/extensions/sql/impl/rel/BeamIOSinkRel.java | 9 +-
.../extensions/sql/impl/rel/BeamIOSourceRel.java | 16 +-
.../extensions/sql/impl/rel/BeamIntersectRel.java | 28 +-
.../sdk/extensions/sql/impl/rel/BeamJoinRel.java | 29 +-
.../sdk/extensions/sql/impl/rel/BeamMinusRel.java | 21 +-
.../sdk/extensions/sql/impl/rel/BeamRelNode.java | 22 ++
.../sdk/extensions/sql/impl/rel/BeamSortRel.java | 15 +-
.../extensions/sql/impl/rel/BeamSqlRelUtils.java | 19 ++
.../extensions/sql/impl/rel/BeamUncollectRel.java | 20 +-
.../sdk/extensions/sql/impl/rel/BeamUnionRel.java | 23 +-
.../sdk/extensions/sql/impl/rel/BeamUnnestRel.java | 34 ++-
.../sdk/extensions/sql/impl/rel/BeamValuesRel.java | 10 +-
.../sql/impl/schema/BeamPCollectionTable.java | 7 +
.../sql/meta/provider/bigquery/BigQueryTable.java | 6 +-
.../sql/meta/provider/kafka/BeamKafkaTable.java | 152 ++++++++++-
.../sql/meta/provider/parquet/ParquetTable.java | 7 +
.../meta/provider/pubsub/PubsubIOJsonTable.java | 7 +
.../provider/seqgen/GenerateSequenceTable.java | 7 +
.../sql/meta/provider/test/TestBoundedTable.java | 2 +-
.../sql/meta/provider/test/TestTableProvider.java | 2 +-
.../sql/meta/provider/test/TestUnboundedTable.java | 2 +-
.../sql/meta/provider/text/TextTable.java | 4 +-
.../sql/impl/planner/BeamCostModelTest.java | 103 ++++++++
.../sql/impl/planner/CalciteQueryPlannerTest.java | 73 ++++++
.../extensions/sql/impl/planner/NodeStatsTest.java | 15 ++
...rceRelTest.java => BeamAggregationRelTest.java} | 70 +++--
...amIOSourceRelTest.java => BeamCalcRelTest.java} | 77 ++++--
.../sql/impl/rel/BeamEnumerableConverterTest.java | 6 +
.../sql/impl/rel/BeamIOSourceRelTest.java | 43 ++-
.../sql/impl/rel/BeamIntersectRelTest.java | 27 ++
.../impl/rel/BeamJoinRelBoundedVsBoundedTest.java | 75 ++++++
.../rel/BeamJoinRelUnboundedVsBoundedTest.java | 48 ++++
.../rel/BeamJoinRelUnboundedVsUnboundedTest.java | 50 +++-
.../extensions/sql/impl/rel/BeamMinusRelTest.java | 99 ++++++-
.../extensions/sql/impl/rel/BeamSortRelTest.java | 39 ++-
.../sql/impl/rel/BeamUncollectRelTest.java | 103 ++++++++
.../extensions/sql/impl/rel/BeamUnionRelTest.java | 53 ++++
.../extensions/sql/impl/rel/BeamUnnestRelTest.java | 71 +++++
.../extensions/sql/impl/rel/BeamValuesRelTest.java | 24 ++
.../sql/impl/rule/JoinReorderingTest.java | 6 +-
.../meta/provider/bigquery/BigQueryRowCountIT.java | 6 +-
.../meta/provider/bigquery/BigQueryTestTable.java | 4 +-
.../meta/provider/kafka/BeamKafkaCSVTableTest.java | 118 ++++++++-
.../sql/meta/provider/kafka/KafkaCSVTableIT.java | 292 +++++++++++++++++++++
.../sql/meta/provider/kafka/KafkaCSVTestTable.java | 197 ++++++++++++++
.../sql/meta/provider/kafka/KafkaTestRecord.java | 39 +++
sdks/java/io/bigquery-io-perf-tests/build.gradle | 40 +++
.../BigQueryIOReadPerformanceIT.java | 203 ++++++++++++++
.../sdk/io/gcp/bigquery/BigQueryAvroUtils.java | 2 +-
.../beam/sdk/io/gcp/bigquery/BigQueryHelpers.java | 6 +-
.../beam/sdk/io/gcp/bigquery/BigQueryIO.java | 13 +-
.../beam/sdk/io/gcp/bigquery/BigQueryServices.java | 4 +-
.../beam/sdk/io/gcp/bigquery/BigQueryUtils.java | 6 +
.../sdk/io/gcp/testing}/FakeBigQueryServices.java | 14 +-
.../sdk/io/gcp/testing}/FakeDatasetService.java | 5 +-
.../beam/sdk/io/gcp/testing}/FakeJobService.java | 7 +-
.../beam/sdk/io/gcp/testing}/TableContainer.java | 2 +-
.../apache/beam/sdk/io/gcp/GcpApiSurfaceTest.java | 2 +-
.../sdk/io/gcp/bigquery/BigQueryIOReadTest.java | 3 +
.../gcp/bigquery/BigQueryIOStorageQueryTest.java | 5 +-
.../io/gcp/bigquery/BigQueryIOStorageReadTest.java | 4 +-
.../sdk/io/gcp/bigquery/BigQueryIOWriteTest.java | 3 +
.../beam/sdk/io/gcp/datastore/DatastoreV1Test.java | 12 +-
.../apache/beam/sdk/io/hdfs/HadoopFileSystem.java | 84 +++---
.../sdk/io/hdfs/HadoopFileSystemRegistrar.java | 49 +++-
.../beam/sdk/io/hdfs/HadoopFileSystemTest.java | 12 +-
sdks/java/io/kinesis/build.gradle | 2 +-
sdks/java/testing/expansion-service/build.gradle | 52 ++++
.../beam/sdk/expansion/TestExpansionService.java | 101 +++++++
sdks/python/apache_beam/coders/avro_coder.py | 99 -------
sdks/python/apache_beam/coders/avro_coder_test.py | 71 -----
sdks/python/apache_beam/coders/avro_record.py | 38 +++
sdks/python/apache_beam/coders/coder_impl.py | 23 ++
sdks/python/apache_beam/coders/coders.py | 41 ++-
sdks/python/apache_beam/coders/coders_test.py | 43 +++
.../apache_beam/coders/coders_test_common.py | 1 +
.../snippets/transforms/element_wise/flat_map.py | 27 ++
.../transforms/element_wise/flat_map_test.py | 71 ++---
.../snippets/transforms/element_wise/map.py | 23 ++
.../snippets/transforms/element_wise/map_test.py | 71 ++---
.../transforms/element_wise/with_timestamps.py | 22 ++
.../element_wise/with_timestamps_test.py | 110 ++++----
.../python/apache_beam/examples/wordcount_xlang.py | 2 +-
.../apache_beam/io/external/generate_sequence.py | 5 +-
.../io/external/generate_sequence_test.py | 64 +++++
.../io/external/xlang_parquetio_test.py | 88 +++++++
sdks/python/apache_beam/io/filebasedsink_test.py | 8 +
.../apache_beam/io/gcp/datastore/v1/helper.py | 3 +-
.../io/gcp/datastore/v1/query_splitter_test.py | 11 +-
.../io/gcp/datastore/v1new/query_splitter.py | 63 ++++-
.../io/gcp/datastore/v1new/query_splitter_test.py | 51 +++-
.../apache_beam/io/gcp/datastore/v1new/types.py | 36 ++-
.../io/gcp/datastore/v1new/types_test.py | 26 ++
.../apache_beam/io/gcp/tests/bigquery_matcher.py | 4 +-
sdks/python/apache_beam/io/gcp/tests/utils.py | 57 +++-
sdks/python/apache_beam/io/gcp/tests/utils_test.py | 200 +++++++++++++-
.../python/apache_beam/options/pipeline_options.py | 18 +-
sdks/python/apache_beam/pipeline.py | 6 +-
sdks/python/apache_beam/pipeline_test.py | 34 +++
.../apache_beam/runners/dataflow/internal/names.py | 4 +-
.../apache_beam/runners/direct/direct_userstate.py | 134 +++++++++-
.../runners/portability/expansion_service_test.py | 50 ++--
.../runners/portability/fn_api_runner.py | 115 ++++----
.../runners/portability/local_job_service.py | 9 +-
.../apache_beam/runners/worker/bundle_processor.py | 74 +++++-
.../chicago_taxi}/__init__.py | 0
.../testing/benchmarks/chicago_taxi/preprocess.py | 258 ++++++++++++++++++
.../benchmarks/chicago_taxi/process_tfma.py | 191 ++++++++++++++
.../benchmarks/chicago_taxi/requirements.txt | 25 ++
.../testing/benchmarks/chicago_taxi/run_chicago.sh | 192 ++++++++++++++
.../testing/benchmarks/chicago_taxi/setup.py | 41 +++
.../chicago_taxi/tfdv_analyze_and_validate.py | 225 ++++++++++++++++
.../chicago_taxi/trainer}/__init__.py | 0
.../benchmarks/chicago_taxi/trainer/model.py | 164 ++++++++++++
.../benchmarks/chicago_taxi/trainer/task.py | 189 +++++++++++++
.../benchmarks/chicago_taxi/trainer/taxi.py | 186 +++++++++++++
.../python/apache_beam/testing/extra_assertions.py | 64 +++++
.../apache_beam/testing/extra_assertions_test.py | 71 +++++
.../apache_beam/testing/load_tests/build.gradle | 2 +-
sdks/python/apache_beam/testing/test_pipeline.py | 3 +
sdks/python/apache_beam/transforms/combiners.py | 4 +-
sdks/python/apache_beam/transforms/core.py | 60 +++--
.../python/apache_beam/transforms/external_test.py | 72 +++--
sdks/python/apache_beam/transforms/trigger.py | 16 ++
sdks/python/apache_beam/transforms/userstate.py | 110 +++-----
.../apache_beam/transforms/userstate_test.py | 156 ++++++++++-
sdks/python/apache_beam/transforms/util.py | 222 ++++++++++++++++
sdks/python/apache_beam/transforms/util_test.py | 280 ++++++++++++++++++++
.../typehints/trivial_inference_test.py | 21 --
.../typehints/trivial_inference_test_py3.py | 50 ++++
sdks/python/build.gradle | 34 ++-
sdks/python/container/build.gradle | 2 +-
sdks/python/container/py3/build.gradle | 2 +-
sdks/python/container/run_validatescontainer.sh | 1 +
sdks/python/scripts/generate_pydoc.sh | 3 +
sdks/python/scripts/run_expansion_services.sh | 136 ++++++++++
sdks/python/scripts/run_integration_test.sh | 1 +
sdks/python/scripts/run_pylint.sh | 32 ++-
sdks/python/test-suites/dataflow/py2/build.gradle | 37 ++-
sdks/python/test-suites/dataflow/py35/build.gradle | 17 +-
sdks/python/test-suites/dataflow/py36/build.gradle | 17 +-
sdks/python/test-suites/dataflow/py37/build.gradle | 21 +-
sdks/python/test-suites/portable/py2/build.gradle | 10 +-
sdks/python/tox.ini | 6 +-
settings.gradle | 2 +
website/src/contribute/design-documents.md | 1 +
website/src/contribute/release-guide.md | 7 +-
website/src/contribute/runner-guide.md | 26 +-
website/src/documentation/io/built-in.md | 4 +-
.../patterns/file-processing-patterns.md | 2 +-
website/src/documentation/runners/jet.md | 41 +--
.../src/documentation/sdks/python-dependencies.md | 39 +++
.../sdks/python-pipeline-dependencies.md | 2 +-
.../src/documentation/transforms/python/index.md | 2 +-
.../transforms/python/other/reshuffle.md | 2 +-
website/src/get-started/quickstart-py.md | 2 +-
website/src/roadmap/python-sdk.md | 2 +-
233 files changed, 8204 insertions(+), 1473 deletions(-)
create mode 100644 .test-infra/jenkins/Kubernetes.groovy
create mode 100644
.test-infra/jenkins/job_PostCommit_CrossLanguageValidatesRunner_Flink.groovy
create mode 100644
.test-infra/jenkins/job_PostCommit_Python_Chicago_Taxi_Example_Dataflow.groovy
create mode 100755 .test-infra/kubernetes/kubernetes.sh
delete mode 100644
runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/expansion/TestExpansionService.java
create mode 100644
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/HotKeyLogger.java
create mode 100644
runners/google-cloud-dataflow-java/worker/src/test/java/org/apache/beam/runners/dataflow/worker/HotKeyLoggerTest.java
create mode 100644
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamCostModel.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamCostModelTest.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/CalciteQueryPlannerTest.java
copy
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/rel/{BeamIOSourceRelTest.java
=> BeamAggregationRelTest.java} (60%)
copy
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/rel/{BeamIOSourceRelTest.java
=> BeamCalcRelTest.java} (55%)
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUncollectRelTest.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamUnnestRelTest.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/kafka/KafkaCSVTableIT.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/kafka/KafkaCSVTestTable.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/kafka/KafkaTestRecord.java
create mode 100644 sdks/java/io/bigquery-io-perf-tests/build.gradle
create mode 100644
sdks/java/io/bigquery-io-perf-tests/src/test/java/org/apache/beam/sdk/bigqueryioperftests/BigQueryIOReadPerformanceIT.java
rename
sdks/java/io/google-cloud-platform/src/{test/java/org/apache/beam/sdk/io/gcp/bigquery
=> main/java/org/apache/beam/sdk/io/gcp/testing}/FakeBigQueryServices.java
(87%)
rename
sdks/java/io/google-cloud-platform/src/{test/java/org/apache/beam/sdk/io/gcp/bigquery
=> main/java/org/apache/beam/sdk/io/gcp/testing}/FakeDatasetService.java (98%)
rename
sdks/java/io/google-cloud-platform/src/{test/java/org/apache/beam/sdk/io/gcp/bigquery
=> main/java/org/apache/beam/sdk/io/gcp/testing}/FakeJobService.java (98%)
rename
sdks/java/io/google-cloud-platform/src/{test/java/org/apache/beam/sdk/io/gcp/bigquery
=> main/java/org/apache/beam/sdk/io/gcp/testing}/TableContainer.java (97%)
create mode 100644 sdks/java/testing/expansion-service/build.gradle
create mode 100644
sdks/java/testing/expansion-service/src/test/java/org/apache/beam/sdk/expansion/TestExpansionService.java
delete mode 100644 sdks/python/apache_beam/coders/avro_coder.py
delete mode 100644 sdks/python/apache_beam/coders/avro_coder_test.py
create mode 100644 sdks/python/apache_beam/coders/avro_record.py
create mode 100644
sdks/python/apache_beam/io/external/generate_sequence_test.py
create mode 100644 sdks/python/apache_beam/io/external/xlang_parquetio_test.py
copy sdks/python/apache_beam/testing/{load_tests =>
benchmarks/chicago_taxi}/__init__.py (100%)
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/preprocess.py
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/process_tfma.py
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/requirements.txt
create mode 100755
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/run_chicago.sh
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/setup.py
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/tfdv_analyze_and_validate.py
copy sdks/python/apache_beam/testing/{load_tests =>
benchmarks/chicago_taxi/trainer}/__init__.py (100%)
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/trainer/model.py
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/trainer/task.py
create mode 100644
sdks/python/apache_beam/testing/benchmarks/chicago_taxi/trainer/taxi.py
create mode 100644 sdks/python/apache_beam/testing/extra_assertions.py
create mode 100644 sdks/python/apache_beam/testing/extra_assertions_test.py
create mode 100644
sdks/python/apache_beam/typehints/trivial_inference_test_py3.py
create mode 100755 sdks/python/scripts/run_expansion_services.sh