This is an automated email from the ASF dual-hosted git repository. echauchot pushed a change to branch spark-runner_structured-streaming in repository https://gitbox.apache.org/repos/asf/beam.git.
discard 239bc93 Add setEnableSparkMetricSinks() discard db96d6a Add missing dependencies to run Spark Structured Streaming Runner on Nexmark discard 59cc5dd Add metrics support in DoFn discard fc83520 Ignore for now not working test testCombineGlobally discard 59b9e3f Add a test that combine per key preserves windowing discard 618cc5c Clean groupByKeyTest discard 0347a07 add comment in combine globally test discard c5aad8f Fixed immutable list bug discard e741781 Fix javadoc of AggregatorCombiner discard 85e6df4 Clean not more needed WindowingHelpers discard 6add5fe Clean not more needed RowHelpers discard df16763 Clean no more needed KVHelpers discard bf13e07 Now that there is only Combine.PerKey translation, make only one Aggregator discard e80c908 Remove CombineGlobally translation because it is less performant than the beam sdk one (key + combinePerKey.withHotkeyFanout) discard 37cdd7e Remove the mapPartition that adds a key per partition because otherwise spark will reduce values per key instead of globally discard 38e9eae Fix bug in the window merging logic discard dcb3949 Fix wrong encoder in combineGlobally GBK discard 11c3792 Fix case when a window does not merge into any other window discard 884e5f90 Apply a groupByKey avoids for some reason that the spark structured streaming fmwk casts data to Row which makes it impossible to deserialize without the coder shipped into the data. For performance reasons (avoid memory consumption and having to deserialize), we do not ship coder + data. Also add a mapparitions before GBK to avoid shuffling discard f0522dc [to remove] temporary: revert extractKey while combinePerKey is not done (so that it compiles) discard 9a269ef Fix encoder in combine call discard 8f8bae4 Implement merge accumulators part of CombineGlobally translation with windowing discard 4602f83 Output data after combine discard bba08b4 Implement reduce part of CombineGlobally translation with windowing discard 8d05d46 Fix comment about schemas discard 8a4372d Update KVHelpers.extractKey() to deal with WindowedValue and update GBK and CPK discard 8cdd143 Add TODO in Combine translations discard f86660f Add a test that GBK preserves windowing discard d0f5806 fixup Enable UsesBoundedSplittableParDo category of tests Since the default overwrite is already in place and it is breaking only in one subcase discard 09e6207 Improve visibility of debug messages discard c947455 re-enable reduceFnRunner timers for output discard 89df2bf Re-code GroupByKeyTranslatorBatch to conserve windowing instead of unwindowing/windowing(GlobalWindow): simplify code, use ReduceFnRunner to merge the windows discard 6ba3d1c Add comment about checkpoint mark discard 982197c Update windowAssignTest discard 6ae2056 Put back batch/simpleSourceTest.testBoundedSource discard 728aa1f Consider null object case on RowHelpers, fixes empty side inputs tests. discard 1d9155d fixup hadoop-format is not mandataory to run ValidatesRunner tests discard be79a86 Enable UsesSchema tests on ValidatesRunner discard ad22b68 Pass transform based doFnSchemaInformation in ParDo translation discard 5cf3e1b fixup Enable UsesFailureMessage category of tests discard 017497c Fixes ParDo not calling setup and not tearing down if exception on startBundle discard d08f110 Limit the number of partitions to make tests go 300% faster discard 85a9589 Add Batch Validates Runner tests for Structured Streaming Runner discard f2f953a Apply Spotless discard bd06201 Update javadoc discard f2cac46 implement source.stop discard 4786cf1 Ignore spark offsets (cf javadoc) discard 496d660 Use PAssert in Spark Structured Streaming transform tests discard e542bd3 Rename SparkPipelineResult to SparkStructuredStreamingPipelineResult This is done to avoid an eventual collision with the one in SparkRunner. However this cannot happen at this moment because it is package private, so it is also done for consistency. discard 1081207 Add SparkStructuredStreamingPipelineOptions and SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added to have the new runner rely only on its specific options. discard 182f2d9 Fix logging levels in Spark Structured Streaming translation discard 40f2135 Fix spotless issues after rebase discard 0262dab Add doFnSchemaInformation to ParDo batch translation discard e1c8e96 Fix non-vendored imports from Spark Streaming Runner classes discard 2605bae Remove specific PipelineOotions and Registrar for Structured Streaming Runner discard f12ec6b Rename Runner to SparkStructuredStreamingRunner discard dc756a6 Remove spark-structured-streaming module discard e76b88b Merge Spark Structured Streaming runner into main Spark module discard 17aa0f7 Fix access level issues, typos and modernize code to Java 8 style discard 385f1a6 Disable never ending test SimpleSourceTest.testUnboundedSource discard ec633d1 Fix spotbogs warnings discard be0e9c3 Fix invalid code style with spotless discard 4ba0af3 Deal with checkpoint and offset based read discard 33b310cc Fllow up on offsets for streaming source discard 8a0def6 Clean streaming source discard f1340fd Remove unneeded 0 arg constructor in batch source discard 482bc7f Clean unneeded 0 arg constructor in batch source discard b27b982 Specify checkpointLocation at the pipeline start discard 0204d52 Add source streaming test discard a0f25eb Add transformators registry in PipelineTranslatorStreaming discard 24f45d9 Add a TODO on spark output modes discard 5e857be Start streaming source discard 2edf26e Add streaming source initialisation discard cdfffaf And unchecked warning suppression discard 3a83862 Added TODO comment for ReshuffleTranslatorBatch discard b0a4272 Added using CachedSideInputReader discard 303b6a6 Don't use Reshuffle translation discard 7d87b0a Fix CheckStyle violations discard 0e1ac77 Added SideInput support discard 40ab0cb Temporary fix remove fail on warnings discard 9493bca Fix build wordcount: to squash discard f7f0da9 Fix javadoc discard dccf33a Implement WindowAssignTest discard bbb881a Implement WindowAssignTranslatorBatch discard fcb2746 Cleaning discard 056359b [TO UPGRADE WITH THE 2 SPARK RUNNERS BEFORE MERGE] Change de wordcount build to test on new spark runner discard a4591c4 Fix encoder bug in combinePerkey discard c844fda Add explanation about receiving a Row as input in the combiner discard 32ae8dc Use more generic Row instead of GenericRowWithSchema discard ae07b1a Fix combine. For unknown reason GenericRowWithSchema is used as input of combine so extract its content to be able to proceed discard 06b4632 Update test with Long discard 7b3dfae Fix various type checking issues in Combine.Globally discard a936327 Get back to classes in translators resolution because URNs cannot translate Combine.Globally discard 42414191 Cleaning discard d1e7eea Add CombineGlobally translation to avoid translating Combine.perKey as a composite transform based on Combine.PerKey (which uses low perf GBK) discard 86c421b Introduce RowHelpers discard 8889f71 Add combinePerKey and CombineGlobally tests discard c829100 Fix combiner using KV as input, use binary encoders in place of accumulatorEncoder and outputEncoder, use helpers, spotless discard 9ed9d5b Introduce WindowingHelpers (and helpers package) and use it in Pardo, GBK and CombinePerKey discard 11a87c8 Improve type checking of Tuple2 encoder discard c4e008d First version of combinePerKey discard b37ac03 Extract binary schema creation in a helper class discard f361a7e Fix getSideInputs discard a6e4b37 Generalize the use of SerializablePipelineOptions in place of (not serializable) PipelineOptions discard 205f96d Rename SparkDoFnFilterFunction to DoFnFilterFunction for consistency discard 8e7fc66 Add a test for the most simple possible Combine discard 01cfa11 Added "testTwoPardoInRow" discard 7f22722 Fix for test elements container in GroupByKeyTest discard 07852b1 Rename SparkOutputManager for consistency discard c5b4593 Fix kryo issue in GBK translator with a workaround discard 6abc251 Simplify logic of ParDo translator discard c9ad8d9 Don't use deprecated sideInput.getWindowingStrategyInternal() discard e8d6b77 Rename pruneOutput() to pruneOutputFilteredByTag() discard 112704f Rename SparkSideInputReader class discard e730fa8 Fixed Javadoc error discard 7951f7b Apply spotless discard 67c7d19 Fix Encoders: create an Encoder for every manipulated type to avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders with generic types such as WindowedValue<T> and get the type checking back discard 76fd11c Fail in case of having SideInouts or State/Timers discard 09a5ba2 Add ComplexSourceTest discard 8b1b17e Remove no more needed putDatasetRaw discard 1faf89b Port latest changes of ReadSourceTranslatorBatch to ReadSourceTranslatorStreaming discard 720681f Fix type checking with Encoder of WindowedValue<T> discard 62d4f5d Add comments and TODO to GroupByKeyTranslatorBatch discard 562ebfa Add GroupByKeyTest discard 45ef1af Clean discard 92b991a Address minor review notes discard e2fca21 Add ParDoTest discard f896c3b Cleaning discard c89ac48 Fix split bug discard eb76dce Remove bundleSize parameter and always use spark default parallelism discard f70bb56 Cleaning discard 5e185b3 Fix testMode output to comply with new binary schema discard 63053ec Fix errorprone discard 418c61c Comment schema choices discard 253e350 Serialize windowedValue to byte[] in source to be able to specify a binary dataset schema and deserialize windowedValue from Row to get a dataset<WindowedValue> discard df5409e First attempt for ParDo primitive implementation discard 2255618 Add flatten test discard b3f018d Enable gradle build scan discard d208890 Enable test mode discard 59777ba Put all transform translators Serializable discard 8a4b4af Simplify beam reader creation as it created once the source as already been partitioned discard ac00f29 Fix SourceTest discard 53b9370 Move SourceTest to same package as tested class discard cc0ab21 Add serialization test discard fd63639 Fix SerializationDebugger discard e97b5be Add SerializationDebugger discard 5835df2 Fix serialization issues discard ed569f5 Cleaning unneeded fields in DatasetReader discard fcf47dd improve readability of options passing to the source discard a83aa58 Fix pipeline triggering: use a spark action instead of writing the dataset discard bed6a02 Deactivate deps resolution forcing that prevent using correct spark transitive dep discard 79640b0 Refactor SourceTest to a UTest instaed of a main discard 8096514 Checkstyle and Findbugs discard 0518568 Clean discard 92c4cf2 Add empty 0-arg constructor for mock source discard 6f33718 Add a dummy schema for reader discard 5da4168 Fix checkstyle discard a3f5ca1 Use new PipelineOptionsSerializationUtils discard b048aa7 Apply spotless discard 90d9426 Add missing 0-arg public constructor discard fc8d7f4 Wire real SourceTransform and not mock and update the test discard 2707178 Refactor DatasetSource fields discard 3c7688f Pass Beam Source and PipelineOptions to the spark DataSource as serialized strings discard c9c2b3b Move Source and translator mocks to a mock package. discard 78db97c Add ReadSourceTranslatorStreaming discard d56ef09 Cleaning discard 46c60cd Use raw Encoder<WindowedValue> also in regular ReadSourceTranslatorBatch discard c706109 Split batch and streaming sources and translators discard 37b0a2a Run pipeline in batch mode or in streaming mode discard f0211b5 Move DatasetSourceMock to proper batch mode discard 4b40103 clean deps discard 6724b88 Use raw WindowedValue so that spark Encoders could work (temporary) discard 89d910e fix mock, wire mock in translators and create a main test. discard 590fe0a Add source mocks discard f3d50d4 Experiment over using spark Catalog to pass in Beam Source through spark Table discard 90c92af Improve type enforcement in ReadSourceTranslator discard 50088ea Improve exception flow discard 8bdcd4c start source instanciation discard 2d8db45 Apply spotless discard 261bb7c update TODO discard 9e33e38 Implement read transform discard df0a396 Use Iterators.transform() to return Iterable discard e5b3b64 Add primitive GroupByKeyTranslatorBatch implementation discard 9c3b75a Add Flatten transformation translator discard 356a830 Create Datasets manipulation methods discard 073cf1e Create PCollections manipulation methods discard aa1234e Add basic pipeline execution. Refactor translatePipeline() to return the translationContext on which we can run startPipeline() discard 27c7fd8 Added SparkRunnerRegistrar discard e4a334b Add precise TODO for multiple TransformTranslator per transform URN discard c084b60 Post-pone batch qualifier in all classes names for readability discard dce77db Add TODOs discard 9351397 Make codestyle and firebug happy discard 22c8338 apply spotless for e-formatting new af2e92b apply spotless new 3504543 Make codestyle and firebug happy new 419211a Add TODOs new 60bc61b Post-pone batch qualifier in all classes names for readability new d23f4fd Add precise TODO for multiple TransformTranslator per transform URN new 52da1ea Added SparkRunnerRegistrar new e3ac83a Add basic pipeline execution. Refactor translatePipeline() to return the translationContext on which we can run startPipeline() new 7d7dcc6 Create PCollections manipulation methods new 6ff5192 Create Datasets manipulation methods new b045f0d Add Flatten transformation translator new 3491edf Add primitive GroupByKeyTranslatorBatch implementation new 6a9021d Use Iterators.transform() to return Iterable new dfcfa6a Implement read transform new 039976a update TODO new 4b55ed6 Apply spotless new 9b3a72e start source instanciation new aec130e Improve exception flow new 6cfdf31 Improve type enforcement in ReadSourceTranslator new bc6b4e6 Experiment over using spark Catalog to pass in Beam Source through spark Table new 8a0343a Add source mocks new f99d1d1 fix mock, wire mock in translators and create a main test. new b5cc52a Use raw WindowedValue so that spark Encoders could work (temporary) new 80b91bd clean deps new a0243bf Move DatasetSourceMock to proper batch mode new 0b9875f Run pipeline in batch mode or in streaming mode new 772efc4 Split batch and streaming sources and translators new 98d6e1e Use raw Encoder<WindowedValue> also in regular ReadSourceTranslatorBatch new 2dca3c2 Clean new 5559890 Add ReadSourceTranslatorStreaming new 8bfa1a5 Move Source and translator mocks to a mock package. new a96039e Pass Beam Source and PipelineOptions to the spark DataSource as serialized strings new 0c64b46 Refactor DatasetSource fields new 84f1963 Wire real SourceTransform and not mock and update the test new af3dfad Add missing 0-arg public constructor new c38d1d2 Use new PipelineOptionsSerializationUtils new 73d8c71 Apply spotless and fix checkstyle new 0bdbf93 Add a dummy schema for reader new 2bc0d23 Add empty 0-arg constructor for mock source new b9acd02 Clean new ab70e9e Checkstyle and Findbugs new 4f6d5b9 Refactor SourceTest to a UTest instaed of a main new 6173a33 Deactivate deps resolution forcing that prevent using correct spark transitive dep new 38b1514 Fix pipeline triggering: use a spark action instead of writing the dataset new 3b868db improve readability of options passing to the source new 0cfde2b Clean unneeded fields in DatasetReader new ec13b5b Fix serialization issues new 8ba03d6f Add SerializationDebugger new 4d18207 Add serialization test new 155a65b Move SourceTest to same package as tested class new b6a6912 Fix SourceTest new b39024a Simplify beam reader creation as it created once the source as already been partitioned new 33754a8 Put all transform translators Serializable new 7c4ff5c Enable test mode new 21c426e Enable gradle build scan new 5067ca1 Add flatten test new 000381c First attempt for ParDo primitive implementation new d9d424f Serialize windowedValue to byte[] in source to be able to specify a binary dataset schema and deserialize windowedValue from Row to get a dataset<WindowedValue> new 80aa7ea Comment schema choices new 355fa9e Fix errorprone new 152d858 Fix testMode output to comply with new binary schema new 53ddaa0 Cleaning new 6b87e80 Remove bundleSize parameter and always use spark default parallelism new bab98f1 Fix split bug new afc53ee Clean new 32bf4da Add ParDoTest new cd58952 Address minor review notes new 8e999c9 Clean new df89095 Add GroupByKeyTest new 6eb2ff1 Add comments and TODO to GroupByKeyTranslatorBatch new 55a6419 Fix type checking with Encoder of WindowedValue<T> new 4f55612 Port latest changes of ReadSourceTranslatorBatch to ReadSourceTranslatorStreaming new e4948df Remove no more needed putDatasetRaw new d3c6e42 Add ComplexSourceTest new 24a0f7d Fail in case of having SideInouts or State/Timers new fe00df0 Fix Encoders: create an Encoder for every manipulated type to avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders with generic types such as WindowedValue<T> and get the type checking back new 6efe901 Apply spotless new 8407488 Fixed Javadoc error new 1f6c3b7 Rename SparkSideInputReader class and rename pruneOutput() to pruneOutputFilteredByTag() new c8cefc8 Don't use deprecated sideInput.getWindowingStrategyInternal() new 72bf4d3 Simplify logic of ParDo translator new 106e54e Fix kryo issue in GBK translator with a workaround new 8ede26e Rename SparkOutputManager for consistency new 2b32d52 Fix for test elements container in GroupByKeyTest new d7104ee Added "testTwoPardoInRow" new 3bbb0d1 Add a test for the most simple possible Combine new a33477f Rename SparkDoFnFilterFunction to DoFnFilterFunction for consistency new 28d0be3 Generalize the use of SerializablePipelineOptions in place of (not serializable) PipelineOptions new 05a1a11 Fix getSideInputs new c7256da Extract binary schema creation in a helper class new d1bafd1 First version of combinePerKey new 9407138 Improve type checking of Tuple2 encoder new 6da1c02 Introduce WindowingHelpers (and helpers package) and use it in Pardo, GBK and CombinePerKey new 9b3c604 Fix combiner using KV as input, use binary encoders in place of accumulatorEncoder and outputEncoder, use helpers, spotless new 11f6322 Add combinePerKey and CombineGlobally tests new 7629b22 Introduce RowHelpers new 0c48a94 Add CombineGlobally translation to avoid translating Combine.perKey as a composite transform based on Combine.PerKey (which uses low perf GBK) new b390509 Cleaning new 571b3e9 Get back to classes in translators resolution because URNs cannot translate Combine.Globally new 3ede119 Fix various type checking issues in Combine.Globally new ac7f271 Update test with Long new c453b90 Fix combine. For unknown reason GenericRowWithSchema is used as input of combine so extract its content to be able to proceed new 5d86507 Use more generic Row instead of GenericRowWithSchema new 0bd6370 Add explanation about receiving a Row as input in the combiner new 2a8d92e Fix encoder bug in combinePerkey new 976783c [TO UPGRADE WITH THE 2 SPARK RUNNERS BEFORE MERGE] Change de wordcount build to test on new spark runner new e712187 Cleaning new dafe5c4 Implement WindowAssignTranslatorBatch new 07bceca Implement WindowAssignTest new f7f3b54 Fix javadoc new 8e029bd Fix build wordcount:disable deps versions forcing, remove fail on warnings new 2128544 Added SideInput support new d705c22 Fix CheckStyle violations new d629b1c Don't use Reshuffle translation new 519c9b2 Added using CachedSideInputReader new 9ff127d Added TODO comment for ReshuffleTranslatorBatch new 8bd0f66 And unchecked warning suppression new cb261f2 Add streaming source initialisation new 9803c07 Implement first streaming source new 10fac36 Add a TODO on spark output modes new c46ce22 Add transformators registry in PipelineTranslatorStreaming new a14698f Add source streaming test new e0ecd02 Specify checkpointLocation at the pipeline start new 4ef8635 Clean unneeded 0 arg constructor in batch source new b9af1b5 Clean streaming source new 559e80f Continue impl of offsets for streaming source new 1bc4d75 Deal with checkpoint and offset based read new 85635ff Apply spotless and fix spotbugs warnings new df02413 Disable never ending test SimpleSourceTest.testUnboundedSource new 51630e2 Fix access level issues, typos and modernize code to Java 8 style new 07b01b1 Merge Spark Structured Streaming runner into main Spark module. Remove spark-structured-streaming module.Rename Runner to SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar for Structured Streaming Runner new f36c3b8 Fix non-vendored imports from Spark Streaming Runner classes new 8602eaa Pass doFnSchemaInformation to ParDo batch translation new 1d48ae5 Fix spotless issues after rebase new 45781a8 Fix logging levels in Spark Structured Streaming translation new 4f0e0e6 Add SparkStructuredStreamingPipelineOptions and SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added to have the new runner rely only on its specific options. new 9d6caa1 Rename SparkPipelineResult to SparkStructuredStreamingPipelineResult This is done to avoid an eventual collision with the one in SparkRunner. However this cannot happen at this moment because it is package private, so it is also done for consistency. new 4b568da Use PAssert in Spark Structured Streaming transform tests new e04d455 Ignore spark offsets (cf javadoc) new 82ed0fc implement source.stop new a75e88a Update javadoc new 72f18c7 Apply Spotless new a9bdab6 Add Batch Validates Runner tests for Structured Streaming Runner new 20d5a68 Limit the number of partitions to make tests go 300% faster new d551903 Fixes ParDo not calling setup and not tearing down if exception on startBundle new 785c0d9 fixup Enable UsesFailureMessage and UsesSchema categories on validatesRunner tests new e2eb99a Pass transform based doFnSchemaInformation in ParDo translation new b0f9afd fixup hadoop-format is not mandataory to run ValidatesRunner tests new a5ca4df Consider null object case on RowHelpers, fixes empty side inputs tests. new 8ee2e60 Put back batch/simpleSourceTest.testBoundedSource new f0c415c Update windowAssignTest new ff937d9 Add comment about checkpoint mark new d2cebcc Re-code GroupByKeyTranslatorBatch to conserve windowing instead of unwindowing/windowing(GlobalWindow): simplify code, use ReduceFnRunner to merge the windows new 32756ef re-enable reduceFnRunner timers for output new 4d75434 Improve visibility of debug messages new 8dad2a7 fixup Enable UsesBoundedSplittableParDo category of tests Since the default overwrite is already in place and it is breaking only in one subcase new 0c30d20 Add a test that GBK preserves windowing new 2c16deb Add TODO in Combine translations new 6a6b959 Update KVHelpers.extractKey() to deal with WindowedValue and update GBK and CPK new 50dac73 Fix comment about schemas new dea9155 Implement reduce part of CombineGlobally translation with windowing new 49293d7 Output data after combine new a269fa26 Implement merge accumulators part of CombineGlobally translation with windowing new c4e3f44 Fix encoder in combine call new f4f186b Revert extractKey while combinePerKey is not done (so that it compiles) new a210466 Apply a groupByKey avoids for some reason that the spark structured streaming fmwk casts data to Row which makes it impossible to deserialize without the coder shipped into the data. For performance reasons (avoid memory consumption and having to deserialize), we do not ship coder + data. Also add a mapparitions before GBK to avoid shuffling new ac06721 Fix case when a window does not merge into any other window new 5a5ca48 Fix wrong encoder in combineGlobally GBK new aa5543f Fix bug in the window merging logic new d869c50 Remove the mapPartition that adds a key per partition because otherwise spark will reduce values per key instead of globally new 8569c9d Remove CombineGlobally translation because it is less performant than the beam sdk one (key + combinePerKey.withHotkeyFanout) new 1c28c63 Now that there is only Combine.PerKey translation, make only one Aggregator new 930974f Clean no more needed KVHelpers new a0cd463 Clean not more needed RowHelpers new 319f4c9 Clean not more needed WindowingHelpers new 4110d8f Fix javadoc of AggregatorCombiner new 19779af Fixed immutable list bug new 41a2027 add comment in combine globally test new 418df85 Clean groupByKeyTest new 996a296 Add a test that combine per key preserves windowing new 7803025 Ignore for now not working test testCombineGlobally new d5f322b Add metrics support in DoFn new 7356ec2 Add missing dependencies to run Spark Structured Streaming Runner on Nexmark This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (239bc93) \ N -- N -- N refs/heads/spark-runner_structured-streaming (7356ec2) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 182 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../structuredstreaming/SparkStructuredStreamingPipelineOptions.java | 2 -- 1 file changed, 2 deletions(-)