[ https://issues.apache.org/jira/browse/BEAM-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emiliano Capoccia updated BEAM-9434: ------------------------------------ Summary: Performance improvements processing a large number of Avro files in S3+Spark (was: Performance improvements processiong a large number of Avro files in S3+Spark) > Performance improvements processing a large number of Avro files in S3+Spark > ---------------------------------------------------------------------------- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: io-java-aws, sdk-java-core > Affects Versions: 2.19.0 > Reporter: Emiliano Capoccia > Assignee: Emiliano Capoccia > Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > There is a performance issue when processing in Spark on K8S a large number > of small Avro files (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection<AvroGenClass> records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn and the cluster busy doing coordination of tiny tasks with > high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > -- This message was sent by Atlassian Jira (v8.3.4#803005)