[ https://issues.apache.org/jira/browse/BEAM-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kyle Weaver updated BEAM-9835: ------------------------------ Comment: was deleted (was: I am leaving this as a starter task for an incoming intern. The failure can be reproduced by running the following command in your local Beam repo: ./gradlew :sdks:python:test-suites:portable:py2:sparkValidatesRunner -Ptests="test_multimap_multiside_input" This test uses the same PCollection as a side input multiple times. The reason this test fails is that, since the Spark portable runner keys broadcasts [1] by PCollection ID, we end up with duplicate keys. Since PCollections are immutable, it is only necessary to broadcast a PCollection once, no matter how many times it is used as a side input. Extra credit: does this same bug affect the classic Spark runner? If so, that should be fixed as well. [1] https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/broadcast/Broadcast.html) > test_multimap_multiside_input failing on Spark Python > ----------------------------------------------------- > > Key: BEAM-9835 > URL: https://issues.apache.org/jira/browse/BEAM-9835 > Project: Beam > Issue Type: Bug > Components: test-failures > Reporter: Kyle Weaver > Priority: Major > > beam_PostCommit_Python_VR_Spark is red. > 18:32:46 ERROR: test_multimap_multiside_input (__main__.SparkRunnerTest) > 18:32:46 > ---------------------------------------------------------------------- > 18:32:46 Traceback (most recent call last): > 18:32:46 File > "apache_beam/runners/portability/fn_api_runner/fn_runner_test.py", line 265, > in test_multimap_multiside_input > 18:32:46 equal_to([('a', [1, 3], [1, 2, 3]), ('b', [2], [1, 2, 3])])) > 18:32:46 File "apache_beam/pipeline.py", line 529, in __exit__ > 18:32:46 self.run().wait_until_finish() > 18:32:46 File "apache_beam/runners/portability/portable_runner.py", line > 571, in wait_until_finish > 18:32:46 (self._job_id, self._state, self._last_error_message())) > 18:32:46 RuntimeError: Pipeline > test_multimap_multiside_input_1588026700.62_3808162b-fc6a-4eb0-be3a-3efd819560f7 > failed in state FAILED: java.lang.IllegalArgumentException: Multiple entries > with same key: > ref_PCollection_PCollection_21=(Broadcast(37),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder)) > and > ref_PCollection_PCollection_21=(Broadcast(36),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder)) -- This message was sent by Atlassian Jira (v8.3.4#803005)