abhaygupta3390 opened a new issue #1371: [SUPPORT] Upsert for S3 Hudi dataset 
with large partitions takes a lot of time in writing
URL: https://github.com/apache/incubator-hudi/issues/1371
 
 
   **Describe the problem you faced**
   
   I have a batch stream processing spark app in which I take a bunch of 
upserts and write the result at a s3 location in hudi format. The application 
is running on an EMR cluster.
   The dataset has 3 partition columns and the overall cardinality of the 
partitions is roughly 200 * 2 * 12.
   After the commit and clean is done, the method `createRelation` is invoked 
which takes roughly 9-10 mins and is increasing as the cardinality of the 
partitions is increasing
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Update a few records for a hudi dataset at an S3 location which has a lot 
of partitions  
   
   **Expected behavior**
   
   Since I am writing the DataFrame in append mode to the path, I expect the 
write to be complete at the point when the commit happens
   
   **Environment Description**
   
   * Hudi version : 0.5.1-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : Hive 2.3.6-amzn-1
   
   * Hadoop version : Amazon 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   EMR version: emr-5.29.0
   
   **Stacktrace**
   Debug logs for one iteration of 
`org.apache.hudi.hadoop.HoodieROTablePathFilter#accept` in 
`org.apache.spark.sql.execution.datasources.InMemoryFileIndex#listLeafFiles`:
   
   ```
   20/02/20 13:53:27 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from <s3Path>
   20/02/20 13:53:27 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs:/<emr_node>:8020], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, 
file:/mnt1/yarn/usercache/hadoop/appcache/<app_id>/<container_id>/hive-site.xml],
 FileSystem: [S3AFileSystem{uri=<s3_bucket>, 
workingDir=<s3_bucket>/user/hadoop, inputPolicy=normal, partSize=104857600, 
enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, 
blockSize=33554432, multiPartThreshold=2147483647, 
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=25,
 available=25, waiting=0}, activeCount=0}, 
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@7427ca41[Running, 
pool size = 9, active threads = 0, queued tasks = 0, completed tasks = 9], 
statistics {2524201 bytes read, 932558 bytes written, 1969 read ops, 0 large 
read ops, 224 write ops}, metrics {{Context=S3AFileSystem} 
{FileSystemId=b1dbe2e3-50de-4c75-a12e-4d4d5059b9a7-sense-datawarehouse} 
{fsURI=s3a://sense-datawarehouse} {files_created=8} {files_copied=0} 
{files_copied_bytes=0} {files_deleted=9} {fake_directories_deleted=24} 
{directories_created=1} {directories_deleted=0} {ignored_errors=0} 
{op_copy_from_local_file=0} {op_exists=316} {op_get_file_status=1371} 
{op_glob_status=2} {op_is_directory=222} {op_is_file=0} {op_list_files=166} 
{op_list_located_status=0} {op_list_status=431} {op_mkdirs=0} {op_rename=0} 
{object_copy_requests=0} {object_delete_requests=11} 
{object_list_requests=1686} {object_continue_list_requests=0} 
{object_metadata_requests=2460} {object_multipart_aborted=0} 
{object_put_bytes=932558} {object_put_requests=9} 
{object_put_requests_completed=9} {stream_write_failures=0} 
{stream_write_block_uploads=0} {stream_write_block_uploads_committed=0} 
{stream_write_block_uploads_aborted=0} {stream_write_total_time=0} 
{stream_write_total_data=0} {object_put_requests_active=0} 
{object_put_bytes_pending=0} {stream_write_block_uploads_active=0} 
{stream_write_block_uploads_pending=0} 
{stream_write_block_uploads_data_pending=0} {stream_read_fully_operations=0} 
{stream_opened=194} {stream_bytes_skipped_on_seek=0} {stream_closed=194} 
{stream_bytes_backwards_on_seek=0} {stream_bytes_read=2524201} 
{stream_read_operations_incomplete=476} {stream_bytes_discarded_in_abort=0} 
{stream_close_operations=194} {stream_read_operations=2766} {stream_aborted=0} 
{stream_forward_seek_operations=0} {stream_backward_seek_operations=0} 
{stream_seek_operations=0} {stream_bytes_read_in_close=0} 
{stream_read_exceptions=0} }}]
   20/02/20 13:53:27 INFO HoodieTableConfig: Loading table properties from 
s3a://<path>/<table_name>/.hoodie/hoodie.properties
   20/02/20 13:53:27 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1) from s3a://<path>/<table_name>
   20/02/20 13:53:27 INFO HoodieActiveTimeline: Loaded instants 
[[20200220113826__clean__COMPLETED], [20200220113826__commit__COMPLETED], 
[20200220114307__clean__COMPLETED], [20200220114307__commit__COMPLETED], 
[20200220114742__clean__COMPLETED], [20200220114742__commit__COMPLETED], 
[20200220115229__clean__COMPLETED], [20200220115229__commit__COMPLETED], 
[20200220115716__clean__COMPLETED], [20200220115716__commit__COMPLETED], 
[20200220120158__clean__COMPLETED], [20200220120158__commit__COMPLETED], 
[20200220120630__clean__COMPLETED], [20200220120630__commit__COMPLETED], 
[20200220121120__clean__COMPLETED], [20200220121120__commit__COMPLETED], 
[20200220121605__clean__COMPLETED], [20200220121605__commit__COMPLETED], 
[20200220122055__clean__COMPLETED], [20200220122055__commit__COMPLETED], 
[20200220122552__clean__COMPLETED], [20200220122552__commit__COMPLETED], 
[20200220123052__clean__COMPLETED], [20200220123052__commit__COMPLETED], 
[20200220123556__clean__COMPLETED], [20200220123556__commit__COMPLETED], 
[20200220124053__clean__COMPLETED], [20200220124053__commit__COMPLETED], 
[20200220124553__clean__COMPLETED], [20200220124553__commit__COMPLETED], 
[20200220125055__clean__COMPLETED], [20200220125055__commit__COMPLETED], 
[20200220125600__clean__COMPLETED], [20200220125600__commit__COMPLETED], 
[20200220130650__clean__COMPLETED], [20200220130650__commit__COMPLETED], 
[20200220131306__clean__COMPLETED], [20200220131306__commit__COMPLETED], 
[20200220131919__clean__COMPLETED], [20200220131919__commit__COMPLETED], 
[20200220132515__clean__COMPLETED], [20200220132515__commit__COMPLETED], 
[20200220134152__clean__COMPLETED], [20200220134152__commit__COMPLETED], 
[20200220135148__clean__COMPLETED], [20200220135148__commit__COMPLETED]]
   20/02/20 13:53:27 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :2775011610391456677/2019/11, #FileGroups=1
   20/02/20 13:53:27 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=3, FileGroupsCreationTime=0, StoreTimeTaken=0
   20/02/20 13:53:27 INFO HoodieROTablePathFilter: Based on hoodie metadata 
from base path: s3a:/<path>/<table_name>, caching 1 files under 
s3a://<path>/<table_name>/<part_col1>/<part_col2>/<part_col3>
   ```
   Below is the screenshot of the sparkUI depicting the time gap which 
represents the time taken between the above step and processing the next step:
    
   <img width="1661" alt="Screen Shot 2020-03-04 at 5 11 54 PM" 
src="https://user-images.githubusercontent.com/8233790/75879865-2d5acf80-5e42-11ea-853b-35825e55d889.png";>
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to