[ https://issues.apache.org/jira/browse/CASSANDRA-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723662#comment-17723662 ]
Amit Pawar commented on CASSANDRA-18464: ---------------------------------------- Thank you [~maedhroz] and [~mck] for your suggestions and would like share my observation one last time before jumping to implementation using native API's. Enabling Direct I/O feature using JNA going to be better than using native API's. Why ? Due to number of buffers required and along with un-necessary copying the same data before initiating the OS flush call. Lets see available current implementation described below. # Encrypted and Compressed segments ## ByteBuffer is allocated on Java GC heap to hold the data (threads updates in this buffer) by "COMMIT-LOG-ALLOCATOR" thread. ## Periodic syncing thread initiates flush operation on recent data ### Recent data is duplicated and memory is allocated on Java GC heap for this duplicate copy of data. ### Compression or encryption is done ### Call FileChannel.write with "compressed or encrypted" data to flush. ### FileChannel.write will copy incoming data to internal buffer again which is allocated in Java GC heap (when file was opened). ### Then FileChannel.write will call C library write function to flush the data. ### Syncing thread finishes flushing. ## Advantage of this implementation. ### Memory for ByteBuffer is allocated on Java GC ## Dis-advantage of this implementation ### Data is copied twice before initiating C library write function. ### Allocating large object sizes on Java GC increases GC pressure. Default CommitLog segment size is 32MB and increasing this size will increase this pressure further. ### Affects scaling # MemoryMapped segment ## Memory for ByteBuffer is allocated using FileChannel.map which uses mmap system call to get the memory from OS and not from Java GC heap. ## Syncing thread initiates flush operation by calling C library "msync" function and not involved in any copy operation ## Advantage ### No copying of data and threads directly updates the file content mapped through mmap system call. ### Implementation is simpler ## Dis-advantage a. Syncing thread calls MSYNC system call and fails to exploit available disk throughput. b. MSYNC does not update in serialized manner so Direct I/O available bandwidth is not utilized. c. Scaling is poor If native version of implementation is preferred then it's implementation will be similar "Encrypted and Compressed" segments type. The difference between JNA vs native API's implementation is given below. # Using native API's to enable Direct I/O feature ## Implementation almost similar to "Encrypted and Compressed" segment. Memory for ByteBuffer needs to be allocated (for threads to update the file) and internal buffer for FileChannel is allocated for flush/write call. ## Periodic syncing thread initiates flush operation on recent data. ### Call FileChannel.write to flush the data ### FileChannel.write will copy incoming data to internal buffer that is allocated in Java GC heap (when file was opened). ### Then FileChannel.write will call C library write function to flush the data available in internal buffer. ### Syncing thread finishes flushing. ## Advantages ### Performance will be better than current implementation. ### Scaling will be better due to Direct I/O feature. ## Dis-advantages ### Data needs to be copied at-least once to FileChannel internal buffer and actual sync call is initiated. ### Allocating large object sizes on Java GC increases GC pressure. Increasing CommitLog Segment size will increase this pressure further. ### NVME disk may have different block size to obtain high disk throughput. Does JVM FileChannel classes provide such knobs to control the preferred block size? # Using JNA to enable Direct IO. ## Implementation almost similar to MemoryMapped segment. Memory is allocated to ByteBuffer using ByteBuffer.allocateDirect to get memory from OS and this is where threads updates similar to MemoryMappedSegment. This is not backed by any file. ## Syncer thread initiates flush operation ### ByteBuffer holding data is converted to JNA pointer ### Call C library write function through JNA ### Syncing is done ## Advantage ### Performance will be better than native version of implementation as this thread is not involved in any copy operation. ### GC pressure will be less as memory for ByteBuffer allocated using mmap and not on Java GC heap. Changing CommitLog segment size will not be an issue due to OS memory allocation ### Scaling will be better due to Direct I/O feature. ### Buffer alignment can be controlled based on disk type (only if necessary) through yaml file. ### Block size is configurable through yaml file based on disk type and this will help to get high disk throughput. ## Dis-advantage ### Direct I/O needs to be enabled through JNA and not using native API's. JNA is already used in Cassandra and should not be an issue. So, I feel JNA version of implementation should be preferred due to its advantage and performance. Does these points are good to consider JNA version of implementation to enable Direct IO feature ? > Enable Direct I/O For CommitLog Files > ------------------------------------- > > Key: CASSANDRA-18464 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18464 > Project: Cassandra > Issue Type: New Feature > Components: Local/Commit Log > Reporter: Josh McKenzie > Assignee: Amit Pawar > Priority: Normal > Fix For: 5.x > > Attachments: UseDirectIOFeatureForCommitLogFiles.patch > > > Relocating from [dev@ email > thread.|https://lists.apache.org/thread/j6ny17q2rhkp7jxvwxm69dd6v1dozjrg] > > I shared my investigation about Commitlog I/O issue on large core count > system in my previous email dated July-22 and link to the thread is given > below. > [https://lists.apache.org/thread/xc5ocog2qz2v2gnj4xlw5hbthfqytx2n] > Basically, two solutions looked possible to improve the CommitLog I/O. > # Multi-threaded syncing > # Using Direct-IO through JNA > I worked on 2nd option considering the following benefit compared to the > first one > # Direct I/O read/write throughput is very high compared to non-Direct I/O. > Learnt through FIO benchmarking. > # Reduces kernel file cache uses which in-turn reduces kernel I/O activity > for Commitlog files only. > # Overall CPU usage reduced for flush activity. JVisualvm shows CPU usage < > 30% for Commitlog syncer thread with Direct I/O feature > # Direct I/O implementation is easier compared to multi-threaded > As per the community suggestion, less in code complex is good to have. Direct > I/O enablement looked promising but there was one issue. > Java version 8 does not have native support to enable Direct I/O. So, JNA > library usage is must. The same implementation should also work across other > versions of Java (like 11 and beyond). > I have completed Direct I/O implementation and summary of the attached patch > changes are given below. > # This implementation is not using Java file channels and file is opened > through JNA to use Direct I/O feature. > # New Segment are defined named “DirectIOSegment” for Direct I/O and > “NonDirectIOSegment” for non-direct I/O (NonDirectIOSegment is test purpose > only). > # JNA write call is used to flush the changes. > # New helper functions are defined in NativeLibrary.java and platform > specific file. Currently tested on Linux only. > # Patch allows user to configure optimum block size and alignment if > default values are not OK for CommitLog disk. > # Following configuration options are provided in Cassandra.yaml file > a. use_jna_for_commitlog_io : to use jna feature > b. use_direct_io_for_commitlog : to use Direct I/O feature. > c. direct_io_minimum_block_alignment: 512 (default) > d. nvme_disk_block_size: 32MiB (default and can be changed as per the > required size) > Test matrix is complex so CommitLog related testcases and TPCx-IOT benchmark > was tested. It works with both Java 8 and 11 versions. Compressed and > Encrypted based segments are not supported yet and it can be enabled later > based on the Community feedback. > Following improvement are seen with Direct I/O enablement. > # 32 cores >= ~15% > # 64 cores >= ~80% > Also, another observation would like to share here. Reading Commitlog files > with Direct I/O might help in reducing node bring-up time after the node > crash. > Tested with commit ID: 91f6a9aca8d3c22a03e68aa901a0b154d960ab07 > The attached patch enables Direct I/O feature for Commitlog files. Please > check and share your feedback. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org