Re: Beam 2.15.0 SparkRunner issues

Jan Lukavský Thu, 03 Oct 2019 04:21:41 -0700

Hi Tim,

can you please elaborate more about some parts?

1) What happens actually in your case? What is the specific settings youuse?


3) Can you share stacktrace? Is it always the same, or does it change?

The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle,which seems to make a little sense to me regarding the logic youdescribed. Do you use Reshuffle transform or does it expand from someother transform?


Jan

On 10/3/19 9:24 AM, Tim Robertson wrote:

Hi all,
We haven't dug enough into this to know where to log issues, but I'llstart by sharing here.
After upgrading from Beam 2.10.0 to 2.15.0 we see issues onSparkRunner - we suspect all of this related.
1. spark.default.parallelism is not respected
2. File writing (Avro) with dynamic destinations (grouped into foldersby a field name) consistently fail withorg.apache.beam.sdk.util.UserCodeException:java.nio.file.FileAlreadyExistsException: Unable to rename resourcehdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0tohdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-00000-of-00001.avroas destination already exists and couldn't be deleted.
3. GBK operations that run over 500M small records consistently failwith OOM. We tried different configs with 48GB, 60GB, 80GB executormemory
Our pipelines run are batch, simple transformations with either anHBaseSnapshot to Avro files or a merge of records in Avro (the GBKissue) pushed to ElasticSearch (it fails upstream of theElasticsearchIO in the GBK stage).
We notice operations that were mapToPair in 2.10.0 become repartitionoperations ( (mapToPair at GroupCombineFunctions.java:68 becomesrepartition at GroupCombineFunctions.java:202)) which might be relatedto this and looks surprising.
I'll report more as we learn. If anyone has any immediate ideas basedon their commits or reviews or if you wish an tests run on other Beamversions please say.
Thanks,
Tim

Re: Beam 2.15.0 SparkRunner issues

Reply via email to