[ 
https://issues.apache.org/jira/browse/HUDI-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-656:
--------------------------------
    Labels: pull-request-available  (was: )

> Write Performance - Driver spends too much time creating Parquet DataSource 
> after writes
> ----------------------------------------------------------------------------------------
>
>                 Key: HUDI-656
>                 URL: https://issues.apache.org/jira/browse/HUDI-656
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Spark Integration
>            Reporter: Udit Mehrotra
>            Assignee: Udit Mehrotra
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.0
>
>
> h2. Problem Statement
> We have noticed this performance bottleneck at EMR, and it has been reported 
> here as well [https://github.com/apache/incubator-hudi/issues/1371]
> Hudi for writes through DataSource API uses 
> [this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
>  to create the spark relation. Here it uses HoodieSparkSqlWriter to write the 
> dataframe and after it tries to 
> [return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
>  a relation by creating it through parquet data source 
> [here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]
> In the process of creating this parquet data source, Spark creates an 
> *InMemoryFileIndex* 
> [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
>  as part of which it performs file listing of the base path. While the 
> listing itself is 
> [parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
>  the filter that we pass which is *HoodieROTablePathFilter* is applied 
> [sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
>  on the driver side on all the 1000s of files returned during listing. This 
> part is not parallelized by spark, and it takes a lot of time probably 
> because of the filters logic. This causes the driver to just spend time 
> filtering. We have seen it take 10-12 minutes to do this process for just 50 
> partitions in S3, and this time is spent after the writing has finished.
> Solving this will significantly reduce the writing time across all sorts of 
> writes. This time is essentially getting wasted, because we do not really 
> have to return a relation after the write. This relation is never really used 
> by Spark either ways 
> [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
>  and writing process returns empty set of rows..
> h2. Proposed Solution
> Proposal is to return an Empty Spark relation after the write, which will cut 
> down all this unnecessary time spent to create a parquet relation that never 
> gets used.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to