[ 
https://issues.apache.org/jira/browse/CASSANDRA-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17723662#comment-17723662
 ] 

Amit Pawar commented on CASSANDRA-18464:
----------------------------------------

Thank you [~maedhroz] and [~mck]  for your suggestions and would like share my 
observation one last time before jumping to implementation using native API's.

Enabling Direct I/O feature using JNA going to be better than using native 
API's.

Why ?
Due to number of buffers required and along with un-necessary copying the same 
data before initiating the OS flush call. Lets see available current 
implementation described below.
 # Encrypted and Compressed segments
 ## ByteBuffer is allocated on Java GC heap to hold the data (threads updates 
in this buffer) by "COMMIT-LOG-ALLOCATOR" thread.
 ## Periodic syncing thread initiates flush operation on recent data
 ### Recent data is duplicated and memory is allocated on Java GC heap for this 
duplicate copy of data.
 ### Compression or encryption is done
 ### Call FileChannel.write with "compressed or encrypted" data to flush.
 ### FileChannel.write will copy incoming data to internal buffer again which 
is allocated in Java GC heap (when file was opened).
 ### Then FileChannel.write will call C library write function to flush the 
data.
 ### Syncing thread finishes flushing.
 ## Advantage of this implementation.
 ### Memory for ByteBuffer is allocated on Java GC
 ## Dis-advantage of this implementation
 ### Data is copied twice before initiating C library write function.
 ### Allocating large object sizes on Java GC increases GC pressure. Default 
CommitLog segment size is 32MB and increasing this size will increase this 
pressure further.
 ### Affects scaling 
 # MemoryMapped segment
 ## Memory for ByteBuffer is allocated using FileChannel.map which uses mmap 
system call to get the memory from OS and not from Java GC heap.
 ## Syncing thread initiates flush operation by calling C library "msync" 
function and not involved in any copy operation
 ## Advantage
 ### No copying of data and threads directly updates the file content mapped 
through mmap system call.
 ### Implementation is simpler
 ## Dis-advantage
    a. Syncing thread calls MSYNC system call and fails to exploit available 
disk throughput.
    b. MSYNC does not update in serialized manner so Direct I/O available 
bandwidth is not utilized.
    c. Scaling is poor

If native version of implementation is preferred then it's implementation will 
be similar "Encrypted and Compressed" segments type. The difference between
JNA vs native API's implementation is given below.
 # Using native API's to enable Direct I/O feature
 ## Implementation almost similar to "Encrypted and Compressed" segment. Memory 
for ByteBuffer needs to be allocated (for threads to update the file) and 
internal buffer for FileChannel is allocated for flush/write call.
 ## Periodic syncing thread initiates flush operation on recent data.
 ### Call FileChannel.write to flush the data
 ### FileChannel.write will copy incoming data to internal buffer that is 
allocated in Java GC heap (when file was opened).
 ### Then FileChannel.write will call C library write function to flush the 
data available in internal buffer.
 ### Syncing thread finishes flushing.
 ## Advantages
 ### Performance will be better than current implementation.
 ### Scaling will be better due to Direct I/O feature.
 ## Dis-advantages
 ### Data needs to be copied at-least once to FileChannel internal buffer and 
actual sync call is initiated.
 ### Allocating large object sizes on Java GC increases GC pressure. Increasing 
CommitLog Segment size will increase this pressure further.
 ### NVME disk may have different block size to obtain high disk throughput. 
Does JVM FileChannel classes provide such knobs to control the preferred block 
size?
 # Using JNA to enable Direct IO.
 ## Implementation almost similar to MemoryMapped segment. Memory is allocated 
to ByteBuffer using ByteBuffer.allocateDirect to get memory from OS and this is 
where threads updates similar to MemoryMappedSegment. This is not backed by any 
file.
 ## Syncer thread initiates flush operation
 ### ByteBuffer holding data is converted to JNA pointer
 ### Call C library write function through JNA
 ### Syncing is done
 ## Advantage
 ### Performance will be better than native version of implementation as this 
thread is not involved in any copy operation.
 ### GC pressure will be less as memory for ByteBuffer allocated using mmap and 
not on Java GC heap. Changing CommitLog segment size will not be an issue due 
to OS memory allocation
 ### Scaling will be better due to Direct I/O feature.
 ### Buffer alignment can be controlled based on disk type (only if necessary) 
through yaml file.
 ### Block size is configurable through yaml file based on disk type and this 
will help to get high disk throughput.
 ## Dis-advantage
 ### Direct I/O needs to be enabled through JNA and not using native API's. JNA 
is already used in Cassandra and should not be an issue.

So, I feel JNA version of implementation should be preferred due to its 
advantage and performance. Does these points are good to consider JNA version 
of implementation to enable Direct IO feature ? 

> Enable Direct I/O For CommitLog Files
> -------------------------------------
>
>                 Key: CASSANDRA-18464
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18464
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Local/Commit Log
>            Reporter: Josh McKenzie
>            Assignee: Amit Pawar
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: UseDirectIOFeatureForCommitLogFiles.patch
>
>
> Relocating from [dev@ email 
> thread.|https://lists.apache.org/thread/j6ny17q2rhkp7jxvwxm69dd6v1dozjrg]
>  
> I shared my investigation about Commitlog I/O issue on large core count 
> system in my previous email dated July-22 and link to the thread is given 
> below.
> [https://lists.apache.org/thread/xc5ocog2qz2v2gnj4xlw5hbthfqytx2n]
> Basically, two solutions looked possible to improve the CommitLog I/O.
>  # Multi-threaded syncing
>  # Using Direct-IO through JNA
> I worked on 2nd option considering the following benefit compared to the 
> first one
>  # Direct I/O read/write throughput is very high compared to non-Direct I/O. 
> Learnt through FIO benchmarking.
>  # Reduces kernel file cache uses which in-turn reduces kernel I/O activity 
> for Commitlog files only.
>  # Overall CPU usage reduced for flush activity. JVisualvm shows CPU usage < 
> 30% for Commitlog syncer thread with Direct I/O feature
>  # Direct I/O implementation is easier compared to multi-threaded
> As per the community suggestion, less in code complex is good to have. Direct 
> I/O enablement looked promising but there was one issue. 
> Java version 8 does not have native support to enable Direct I/O. So, JNA 
> library usage is must. The same implementation should also work across other 
> versions of Java (like 11 and beyond).
> I have completed Direct I/O implementation and summary of the attached patch 
> changes are given below.
>  # This implementation is not using Java file channels and file is opened 
> through JNA to use Direct I/O feature.
>  # New Segment are defined named “DirectIOSegment”  for Direct I/O and 
> “NonDirectIOSegment” for non-direct I/O (NonDirectIOSegment is test purpose 
> only).
>  # JNA write call is used to flush the changes.
>  # New helper functions are defined in NativeLibrary.java and platform 
> specific file. Currently tested on Linux only.
>  # Patch allows user to configure optimum block size  and alignment if 
> default values are not OK for CommitLog disk.
>  # Following configuration options are provided in Cassandra.yaml file
> a. use_jna_for_commitlog_io : to use jna feature
> b. use_direct_io_for_commitlog : to use Direct I/O feature.
> c. direct_io_minimum_block_alignment: 512 (default)
> d. nvme_disk_block_size: 32MiB (default and can be changed as per the 
> required size)
>  Test matrix is complex so CommitLog related testcases and TPCx-IOT benchmark 
> was tested. It works with both Java 8 and 11 versions. Compressed and 
> Encrypted based segments are not supported yet and it can be enabled later 
> based on the Community feedback.
>  Following improvement are seen with Direct I/O enablement.
>  # 32 cores >= ~15%
>  # 64 cores >= ~80%
>  Also, another observation would like to share here. Reading Commitlog files 
> with Direct I/O might help in reducing node bring-up time after the node 
> crash.
>  Tested with commit ID: 91f6a9aca8d3c22a03e68aa901a0b154d960ab07
>  The attached patch enables Direct I/O feature for Commitlog files. Please 
> check and share your feedback.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to