Hi Tim,
can you please elaborate more about some parts?
1) What happens actually in your case? What is the specific settings you
use?
3) Can you share stacktrace? Is it always the same, or does it change?
The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle,
which seems to make a little sense to me regarding the logic you
described. Do you use Reshuffle transform or does it expand from some
other transform?
Jan
On 10/3/19 9:24 AM, Tim Robertson wrote:
Hi all,
We haven't dug enough into this to know where to log issues, but I'll
start by sharing here.
After upgrading from Beam 2.10.0 to 2.15.0 we see issues on
SparkRunner - we suspect all of this related.
1. spark.default.parallelism is not respected
2. File writing (Avro) with dynamic destinations (grouped into folders
by a field name) consistently fail with
org.apache.beam.sdk.util.UserCodeException:
java.nio.file.FileAlreadyExistsException: Unable to rename resource
hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
to
hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro
as destination already exists and couldn't be deleted.
3. GBK operations that run over 500M small records consistently fail
with OOM. We tried different configs with 48GB, 60GB, 80GB executor
memory
Our pipelines run are batch, simple transformations with either an
HBaseSnapshot to Avro files or a merge of records in Avro (the GBK
issue) pushed to ElasticSearch (it fails upstream of the
ElasticsearchIO in the GBK stage).
We notice operations that were mapToPair in 2.10.0 become repartition
operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
repartition at GroupCombineFunctions.java:202)) which might be related
to this and looks surprising.
I'll report more as we learn. If anyone has any immediate ideas based
on their commits or reviews or if you wish an tests run on other Beam
versions please say.
Thanks,
Tim