[ https://issues.apache.org/jira/browse/HUDI-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-656: -------------------------------- Labels: pull-request-available (was: ) > Write Performance - Driver spends too much time creating Parquet DataSource > after writes > ---------------------------------------------------------------------------------------- > > Key: HUDI-656 > URL: https://issues.apache.org/jira/browse/HUDI-656 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Spark Integration > Reporter: Udit Mehrotra > Assignee: Udit Mehrotra > Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > > h2. Problem Statement > We have noticed this performance bottleneck at EMR, and it has been reported > here as well [https://github.com/apache/incubator-hudi/issues/1371] > Hudi for writes through DataSource API uses > [this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85] > to create the spark relation. Here it uses HoodieSparkSqlWriter to write the > dataframe and after it tries to > [return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92] > a relation by creating it through parquet data source > [here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72] > In the process of creating this parquet data source, Spark creates an > *InMemoryFileIndex* > [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371] > as part of which it performs file listing of the base path. While the > listing itself is > [parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289], > the filter that we pass which is *HoodieROTablePathFilter* is applied > [sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294] > on the driver side on all the 1000s of files returned during listing. This > part is not parallelized by spark, and it takes a lot of time probably > because of the filters logic. This causes the driver to just spend time > filtering. We have seen it take 10-12 minutes to do this process for just 50 > partitions in S3, and this time is spent after the writing has finished. > Solving this will significantly reduce the writing time across all sorts of > writes. This time is essentially getting wasted, because we do not really > have to return a relation after the write. This relation is never really used > by Spark either ways > [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45] > and writing process returns empty set of rows.. > h2. Proposed Solution > Proposal is to return an Empty Spark relation after the write, which will cut > down all this unnecessary time spent to create a parquet relation that never > gets used. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)