Yes, to expand on what Rich said, there was a talk about Intel QAT offload of gzip at the 2018 OpenZFS Developer Summit: ZFS Hardware Acceleration with QAT <https://openzfs.org/wiki/ZFS_Hardware_Acceleration_with_QAT> Weigang Li Intel slides <https://drive.google.com/file/d/0B_J4mRfoVJQRV3ZOd1ZMWkphcV9OYXdWT0FBblVHbVZpSmZj/view?usp=sharing> video <https://www.youtube.com/watch?v=4zWTU_hnGp0&index=8&list=PLaUVvul17xSe0pC6sCirlZXYqICP09Y8z&t=0s>The results presented show >2x throughput with <1/2 the CPU used, and similar compression to gzip software (I'm guessing with the default gzip level).
QAT support has been in ZFS since 0.8.0. --matt On Mon, Oct 17, 2022 at 12:15 PM Rich <rincebr...@gmail.com> wrote: > I believe the Intel QAT support we have will happily offload gzip for you, > though I don't know if it makes any promises about what level equivalent of > gzip it hands you back... > > - Rich > > On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore <garr...@damore.org> > wrote: > >> That’s about what I would have expected. >> >> Having an offload for high levels of compression (e.g. GZIP 9 or >> something) would be cool, but I don’t think it exists yet. And it would be >> hard to write that in a way that doesn’t punish things for the folks who >> *don’t* have the offload hardware. >> >> - Garrett >> >> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer < >> developer@lists.open-zfs.org>, wrote: >> >> >> >> We have been doing regular performance runs using various workloads over >> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression >> is enabled for all datasets and zvols in our runs. What we have observed >> is, under load, compression consumes the highest CPU cycles, after that it >> is a toss up of dnode locking (a well known issue) and other things that >> might come into play depending on the protocol. >> >> At least in our use cases check summing of blocks does not appear to an >> issue. >> >> -Sanjay >> >> >> >> >> On 10/14/22 10:15 AM, Garrett D'Amore wrote: >> >> I can tell from past experiences that offloads like what you are >> proposing are rarely worth it. The set up and tear down of the mappings to >> allow the data transport are not necessarily cheap. You can avoid that by >> having a preallocated region, but then you need to copy the >> data. Fortunately for this case you only need to copy once, since the >> result will be very small compared to the data. >> >> Then there is the complexity (additional branches, edge cases, etc.) that >> have to be coded. These become performance sapping as well. >> >> Add to this the fact that CPUs are always getting faster, and >> advancements like extensions to the SIMD instructions mean that the >> disparity between the offload and just doing the natural thing inline gets >> ever smaller. >> >> At the end of the day, it’s often the case that your “offload” is >> actually a performance killer. >> >> The exceptions to this are when the work is truly expensive. For >> example, running (in the old days) RSA on an offload engine makes a lot of >> sense. (I’m not sure it does for elliptic curve crypto though.) Running >> 3DES (again if you wanted to do that, which you should not) used to make >> sense. AES used to, but with AES-NI not anymore. I suspect that for SHA2 >> its a toss up. Fletcher probably does not make sense. If you want to >> compress, LZJB does not make sense, but GZIP (especially at higher levels) >> would, if you had such a device. >> >> Algorithms are always getting better (newer ones that are more optimized >> for actual CPUs etc.) and CPUs are always improving — the GPU is probably >> best reserved for truly expensive operations for which it was designed — >> complex transforms for 3D rendering, expensive hashing (although I wish >> that wasn’t a thing), long running scientific analysis, machine learning, >> etc. >> >> As an I/O accelerator, not so much. >> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com> >> <thijs.cra...@gmail.com>, wrote: >> >> I've been searching the GitHub Repository and the Mailing list, but >> couldn't find any discussion about this. >> I know it's probably silly, but I would like to understand the workings. >> >> Let's say one could offload the Checksumming process to a dedicated GPU. >> This might save some amount of CPU, *but* might increase latency >> incredibly. >> >> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, >> and this requires a pass of the data in-memory as it calculates the >> checksum. If we skip this step, and instead send the data to the GPU, that >> would also require a pass of the data (no gains there). >> >> The actual calculation is not that hard for a CPU it seems, there are >> specific SIMD instructions for calculating specific Checksums, and after a >> quick pass over the code, it seems they are already used (if available). >> >> I think the only time that a GPU could calculate checksums 'faster', is >> with a form of readahead. >> If you would pre-read a lot of data, and dump it to the GPU's internal >> memory, and make the GPU calculate checksums of the entire block in >> parallel, it might be able to do it faster than a CPU. >> >> Has anyone considered the idea? >> >> - Thijs >> >> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / > see discussions <https://openzfs.topicbox.com/groups/developer> + > participants <https://openzfs.topicbox.com/groups/developer/members> + > delivery options > <https://openzfs.topicbox.com/groups/developer/subscription> Permalink > <https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0> > ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M5fc111c994baa4ec7a96b27d Delivery options: https://openzfs.topicbox.com/groups/developer/subscription