[ 
https://issues.apache.org/jira/browse/BEAM-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102843#comment-17102843
 ] 

Kyle Weaver commented on BEAM-9835:
-----------------------------------

I am leaving this as a starter task for an incoming intern.

The failure can be reproduced by running the following command in your local 
Beam repo:

./gradlew :sdks:python:test-suites:portable:py2:sparkValidatesRunner 
-Ptests="test_multimap_multiside_input"

This test uses the same PCollection as a side input multiple times. The reason 
this test fails is that, since the Spark portable runner keys broadcasts [1] by 
PCollection ID, we end up with duplicate keys. Since PCollections are 
immutable, it is only necessary to broadcast a PCollection once, no matter how 
many times it is used as a side input.

Extra credit: does this same bug affect the classic Spark runner? If so, that 
should be fixed as well.

[1] 
https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/broadcast/Broadcast.html

> test_multimap_multiside_input failing on Spark Python
> -----------------------------------------------------
>
>                 Key: BEAM-9835
>                 URL: https://issues.apache.org/jira/browse/BEAM-9835
>             Project: Beam
>          Issue Type: Bug
>          Components: test-failures
>            Reporter: Kyle Weaver
>            Priority: Major
>
> beam_PostCommit_Python_VR_Spark is red.
> 18:32:46 ERROR: test_multimap_multiside_input (__main__.SparkRunnerTest)
> 18:32:46 
> ----------------------------------------------------------------------
> 18:32:46 Traceback (most recent call last):
> 18:32:46   File 
> "apache_beam/runners/portability/fn_api_runner/fn_runner_test.py", line 265, 
> in test_multimap_multiside_input
> 18:32:46     equal_to([('a', [1, 3], [1, 2, 3]), ('b', [2], [1, 2, 3])]))
> 18:32:46   File "apache_beam/pipeline.py", line 529, in __exit__
> 18:32:46     self.run().wait_until_finish()
> 18:32:46   File "apache_beam/runners/portability/portable_runner.py", line 
> 571, in wait_until_finish
> 18:32:46     (self._job_id, self._state, self._last_error_message()))
> 18:32:46 RuntimeError: Pipeline 
> test_multimap_multiside_input_1588026700.62_3808162b-fc6a-4eb0-be3a-3efd819560f7
>  failed in state FAILED: java.lang.IllegalArgumentException: Multiple entries 
> with same key: 
> ref_PCollection_PCollection_21=(Broadcast(37),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder))
>  and 
> ref_PCollection_PCollection_21=(Broadcast(36),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to