hudi-bot opened a new issue, #14691:
URL: https://github.com/apache/hudi/issues/14691

   Am using Hudi 0.5.0 .
   
    
   
   trying to write into COW table which includes 350 columns. 
HoodieSparkSQLWriter takes more than 6 mins to merge.  I dont see anything 
spilling to disk. Is there any tuning to improve this one? I have disabled 
   
   option("hoodie.combine.before.upsert","false"). When  this one sets to true, 
didnot see much difference. 
   
    
   
    
   
   20/11/05 07:43:37 INFO DefaultSource: Constructing hoodie (as parquet) data 
source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> 
false, hoodie.datasource.hive_sync.database -> default, 
hoodie.parquet.small.file.limit -> 134217728, 
hoodie.copyonwrite.record.size.estimate -> 160, 
hoodie.insert.shuffle.parallelism -> 1000, path -> 
/projects/cdp/data/cdp_reporting/trr_test2, 
hoodie.datasource.write.precombine.field -> request_id, 
hoodie.datasource.hive_sync.partition_fields -> , 
hoodie.datasource.write.payload.class -> 
com.cybs.cdp.reporting.custom.CustomOverWriteWithLatestAvroPayload, 
hoodie.datasource.hive_sync.partition_extractor_class -> 
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor, 
hoodie.parquet.max.file.size -> 268435456, 
hoodie.datasource.write.streaming.retry.interval.ms -> 2000, 
hoodie.datasource.hive_sync.table -> unknown, 
hoodie.datasource.write.streaming.ignore.failed.batch -> true, 
hoodie.datasource.write.operation -> upsert, hoodie.p
 arquet.compression.codec -> snappy, hoodie.datasource.hive_sync.enable -> 
false, hoodie.datasource.write.recordkey.field -> request_id, 
hoodie.datasource.view.type -> read_optimized, hoodie.table.name -> trr2, 
hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://localhost:10000, 
hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
hoodie.memory.merge.max.size -> 2004857600000, 
hoodie.datasource.write.storage.type -> COPY_ON_WRITE, hoodie.cleaner.policy -> 
KEEP_LATEST_FILE_VERSIONS, hoodie.datasource.hive_sync.username -> hive, 
hoodie.datasource.write.streaming.retry.count -> 3, 
hoodie.combine.before.upsert -> false, hoodie.datasource.hive_sync.password -> 
hive, hoodie.datasource.write.keygenerator.class -> 
org.apache.hudi.ComplexKeyGenerator, hoodie.keep.max.commits -> 3, 
hoodie.upsert.shuffle.parallelism -> 1000, 
hoodie.datasource.hive_sync.assume_date_partitioning -> false, 
hoodie.cleaner.commits.retained -> 1, hoodie.keep.min.commits -> 2, 
hoodie.datasource.write.partitionpath.fie
 ld -> transaction_month, hoodie.datasource.write.commitmeta.key.prefix -> _, 
hoodie.index.bloom.num_entries -> 1500000)
   
    
   
    
   
   Code snippet.
   
    
   
   val responseDF = trrDF.write.format("org.apache.hudi").
    option("hoodie.insert.shuffle.parallelism","1000").
    option("hoodie.upsert.shuffle.parallelism","1000").
    option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
    option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
    option(PRECOMBINE_FIELD_OPT_KEY,"request_id"). 
    option("hoodie.memory.merge.max.size", "2004857600000").
    option(PARTITIONPATH_FIELD_OPT_KEY,"transaction_month").
    option(KEYGENERATOR_CLASS_OPT_KEY,"org.apache.hudi.ComplexKeyGenerator").
    
option(PAYLOAD_CLASS_OPT_KEY,"com.cybs.cdp.reporting.custom.CustomOverWriteWithLatestAvroPayload").
    option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
    option("hoodie.cleaner.commits.retained", 1). 
    option("hoodie.keep.min.commits",2).
    option("hoodie.keep.max.commits",3). 
    option("hoodie.index.bloom.num_entries","1500000").
    option("hoodie.copyonwrite.record.size.estimate","160").
    option("hoodie.parquet.max.file.size",String.valueOf(256*1024*1024)).
    option("hoodie.parquet.small.file.limit",String.valueOf(128*1024*1024)).
    option("hoodie.parquet.compression.codec","snappy").
    *option("hoodie.combine.before.upsert","false").*
    option(RECORDKEY_FIELD_OPT_KEY,"request_id").
    option(TABLE_NAME, "trr2").
    mode(Append).
    save("/projects/cdp/data/cdp_reporting/trr_test2")
   
    
   
   !image-2020-11-04-23-54-12-187.png!
   
    
   
    
   
   tasks corresponding to stage 20:
   
    
   
   !image-2020-11-04-23-58-00-554.png!
   
    
   
   Logs from one of the executor.
   
    
   
    
   
   20/11/05 06:17:21 INFO TorrentBroadcast: Started reading broadcast variable 
20
    20/11/05 06:17:21 INFO MemoryStore: Block broadcast_20_piece0 stored as 
bytes in memory (estimated size 87.2 KB, free 12.1 GB)
    20/11/05 06:17:21 INFO TorrentBroadcast: Reading broadcast variable 20 took 
4 ms
    20/11/05 06:17:21 INFO MemoryStore: Block broadcast_20 stored as values in 
memory (estimated size 239.4 KB, free 12.1 GB)
    20/11/05 06:17:21 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 8, fetching them
    20/11/05 06:17:21 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = 
NettyRpcEndpointRef(spark://[email protected]:33406)
    20/11/05 06:17:21 INFO MapOutputTrackerWorker: Got the output locations
    20/11/05 06:17:21 INFO ShuffleBlockFetcherIterator: Getting 1000 non-empty 
blocks out of 1000 blocks
    20/11/05 06:17:21 INFO ShuffleBlockFetcherIterator: Started 11 remote 
fetches in 3 ms
    20/11/05 06:17:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:22 INFO HoodieMergeHandle: MaxMemoryPerPartitionMerge => 
2004857600000
    20/11/05 06:17:22 INFO DiskBasedMap: Spilling to file location 
/tmp/19d435cf-e581-4b80-b286-9aa092587c6f in host (10.160.39.146) with hostname 
(sl73caehdn0708.visa.com)
    20/11/05 06:17:22 INFO HoodieRecordSizeEstimator: SizeOfRecord => 2552 
SizeOfSchema => 273456
    20/11/05 06:17:22 INFO ExternalSpillableMap: Estimated Payload size => 2664
    20/11/05 06:17:22 INFO ExternalSpillableMap: New Estimated Payload size => 
1688
    20/11/05 06:17:37 INFO HoodieMergeHandle: Number of entries in 
MemoryBasedMap => 1476470Total size in bytes of MemoryBasedMap => 
2492281440Number of entries in DiskBasedMap => 0Size of file spilled to disk => 0
    20/11/05 06:17:37 INFO FileSystemViewManager: Creating View Manager with 
storage type :MEMORY
    20/11/05 06:17:37 INFO FileSystemViewManager: Creating in-memory based 
Table View
    20/11/05 06:17:37 INFO FileSystemViewManager: Creating InMemory based view 
for basePath /projects/cdp/data/cdp_reporting/trr_test2
    20/11/05 06:17:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from /projects/cdp/data/cdp_reporting/trr_test2
    20/11/05 06:17:37 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:37 INFO HoodieTableConfig: Loading dataset properties from 
/projects/cdp/data/cdp_reporting/trr_test2/.hoodie/hoodie.properties
    20/11/05 06:17:37 INFO HoodieTableMetaClient: Finished Loading Table of 
type COPY_ON_WRITE from /projects/cdp/data/cdp_reporting/trr_test2
    20/11/05 06:17:37 INFO HoodieTableMetaClient: Loading Active commit 
timeline for /projects/cdp/data/cdp_reporting/trr_test2
    20/11/05 06:17:37 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@7579056
    20/11/05 06:17:37 INFO AbstractTableFileSystemView: Building file system 
view for partition (202010)
    20/11/05 06:17:37 INFO AbstractTableFileSystemView: #files found in 
partition (202010) =27, Time taken =1
    20/11/05 06:17:37 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :202010, #FileGroups=8
    20/11/05 06:17:37 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=27, FileGroupsCreationTime=2, StoreTimeTaken=0
    20/11/05 06:17:37 INFO AbstractTableFileSystemView: Time to load partition 
(202010) =4
    20/11/05 06:17:37 INFO HoodieMergeHandle: partitionPath:202010, fileId to 
be merged:3a404978-eaad-4825-b88a-dc24fff0c623-0
    20/11/05 06:17:37 INFO HoodieMergeHandle: Merging new data into oldPath 
/projects/cdp/data/cdp_reporting/trr_test2/202010/3a404978-eaad-4825-b88a-dc24fff0c623-0_3-24-10680_20201105055955.parquet,
 as newPath 
/projects/cdp/data/cdp_reporting/trr_test2/202010/3a404978-eaad-4825-b88a-dc24fff0c623-0_4-24-10681_20201105061418.parquet
    20/11/05 06:17:37 INFO HoodieWriteHandle: Creating Marker 
Path=/projects/cdp/data/cdp_reporting/trr_test2/.hoodie/.temp/20201105061418/202010/3a404978-eaad-4825-b88a-dc24fff0c623-0_4-24-10681_20201105061418.marker
    20/11/05 06:17:37 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:37 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:38 INFO CodecPool: Got brand-new compressor [.snappy]
    20/11/05 06:17:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: ], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1395542441_138, ugi=svchdc36q 
(auth:SIMPLE)]]]
    20/11/05 06:17:42 INFO IteratorBasedQueueProducer: starting to buffer 
records
    20/11/05 06:17:42 INFO BoundedInMemoryExecutor: starting consumer thread
    *20/11/05 06:17:42 INFO CodecPool: Got brand-new decompressor [.snappy]*
    *20/11/05 06:23:41 INFO IteratorBasedQueueProducer: finished buffering 
records*
    *20/11/05 06:23:41 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads*
    20/11/05 06:23:48 INFO HoodieMergeHandle: MergeHandle for partitionPath 
202010 fileID 3a404978-eaad-4825-b88a-dc24fff0c623-0, took 385963 ms.
    20/11/05 06:23:48 INFO MemoryStore: Block rdd_59_4 stored as bytes in 
memory (estimated size 304.0 B, free 12.1 GB)
    20/11/05 06:23:48 INFO Executor: Finished task 4.0 in stage 24.0 (TID 
10681). 1010 bytes result sent to driver
    20/11/05 06:23:49 INFO CoarseGrainedExecutorBackend: Got assigned task 10686
    20/11/05 06:23:49 INFO Executor: Running task 4.0 in stage 30.0 (TID 10686)
    20/11/05 06:23:49 INFO TorrentBroadcast: Started reading broadcast variable 
21
    20/11/05 06:23:49 INFO MemoryStore: Block broadcast_21_piece0 stored as 
bytes in memory (estimated size 87.2 KB, free 12.1 GB)
    20/11/05 06:23:49 INFO TorrentBroadcast: Reading broadcast variable 21 took 
4 ms
    20/11/05 06:23:49 INFO MemoryStore: Block broadcast_21 stored as values in 
memory (estimated size 239.6 KB, free 12.1 GB)
    20/11/05 06:23:50 INFO BlockManager: Found block rdd_59_4 locally
    20/11/05 06:23:50 INFO Executor: Finished task 4.0 in stage 30.0 (TID 
10686). 1103 bytes result sent to driver
    20/11/05 06:23:51 INFO CoarseGrainedExecutorBackend: Driver commanded a 
shutdown
    20/11/05 06:23:51 INFO MemoryStore: MemoryStore cleared
    20/11/05 06:23:51 INFO BlockManager: BlockManager stopped
    20/11/05 06:23:51 INFO ShutdownHookManager: Shutdown hook called
   
    
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1372
   - Type: Task
   - Attachment(s):
     - 05/Nov/20 
07:54;Selvaraj.periyasamy1983;image-2020-11-04-23-54-12-187.png;https://issues.apache.org/jira/secure/attachment/13014753/image-2020-11-04-23-54-12-187.png
     - 05/Nov/20 
07:58;Selvaraj.periyasamy1983;image-2020-11-04-23-58-00-554.png;https://issues.apache.org/jira/secure/attachment/13014752/image-2020-11-04-23-58-00-554.png
     - 05/Nov/20 
08:00;Selvaraj.periyasamy1983;image-2020-11-05-00-00-24-066.png;https://issues.apache.org/jira/secure/attachment/13014751/image-2020-11-05-00-00-24-066.png


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to