Hi Tim,

can you please elaborate more about some parts?

1) What happens actually in your case? What is the specific settings you use?

3) Can you share stacktrace? Is it always the same, or does it change?

The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle, which seems to make a little sense to me regarding the logic you described. Do you use Reshuffle transform or does it expand from some other transform?

Jan

On 10/3/19 9:24 AM, Tim Robertson wrote:
Hi all,

We haven't dug enough into this to know where to log issues, but I'll start by sharing here.

After upgrading from Beam 2.10.0 to 2.15.0 we see issues on SparkRunner - we suspect all of this related.

1. spark.default.parallelism is not respected

2. File writing (Avro) with dynamic destinations (grouped into folders by a field name) consistently fail with org.apache.beam.sdk.util.UserCodeException: java.nio.file.FileAlreadyExistsException: Unable to rename resource hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0 to hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avro as destination already exists and couldn't be deleted.

3. GBK operations that run over 500M small records consistently fail with OOM. We tried different configs with 48GB, 60GB, 80GB executor memory

Our pipelines run are batch, simple transformations with either an HBaseSnapshot to Avro files or a merge of records in Avro (the GBK issue) pushed to ElasticSearch (it fails upstream of the ElasticsearchIO in the GBK stage).

We notice operations that were mapToPair  in 2.10.0 become repartition operations ( (mapToPair at GroupCombineFunctions.java:68 becomes repartition at GroupCombineFunctions.java:202)) which might be related to this and looks surprising.

I'll report more as we learn. If anyone has any immediate ideas based on their commits or reviews or if you wish an tests run on other Beam versions please say.

Thanks,
Tim



Reply via email to