Re: [developer] GPU Accelerated Checksumming

Matthew Ahrens via openzfs-developer Mon, 17 Oct 2022 12:26:57 -0700

Yes, to expand on what Rich said, there was a talk about Intel QAT offload
of gzip at the 2018 OpenZFS Developer Summit:
ZFS Hardware Acceleration with QAT
<https://openzfs.org/wiki/ZFS_Hardware_Acceleration_with_QAT> Weigang Li
Intel slides
<https://drive.google.com/file/d/0B_J4mRfoVJQRV3ZOd1ZMWkphcV9OYXdWT0FBblVHbVZpSmZj/view?usp=sharing>
video
<https://www.youtube.com/watch?v=4zWTU_hnGp0&index=8&list=PLaUVvul17xSe0pC6sCirlZXYqICP09Y8z&t=0s>The
results presented show >2x throughput with <1/2 the CPU used, and similar
compression to gzip software (I'm guessing with the default gzip level).


QAT support has been in ZFS since 0.8.0.

--matt

On Mon, Oct 17, 2022 at 12:15 PM Rich <rincebr...@gmail.com> wrote:

> I believe the Intel QAT support we have will happily offload gzip for you,
> though I don't know if it makes any promises about what level equivalent of
> gzip it hands you back...
>
> - Rich
>
> On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore <garr...@damore.org>
> wrote:
>
>> That’s about what I would have expected.
>>
>> Having an offload for high levels of compression (e.g. GZIP 9 or
>> something) would be cool, but I don’t think it exists yet.  And it would be
>> hard to write that in a way that doesn’t punish things for the folks who
>> *don’t* have the offload hardware.
>>
>>    - Garrett
>>
>> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
>> developer@lists.open-zfs.org>, wrote:
>>
>>
>>
>> We have been doing regular performance runs using various workloads over
>> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
>> is enabled for all datasets and zvols in our runs. What we have observed
>> is, under load, compression consumes the highest CPU cycles, after that it
>> is a toss up of dnode locking (a well known issue) and other things that
>> might come into play depending on the protocol.
>>
>> At least in our use cases check summing of blocks does not appear to an
>> issue.
>>
>> -Sanjay
>>
>>
>>
>>
>> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>>
>> I can tell from past experiences that offloads like what you are
>> proposing are rarely worth it.  The set up and tear down of the mappings to
>> allow the data transport are not necessarily cheap.  You can avoid that by
>> having a preallocated region, but then you need to copy the
>> data.  Fortunately for this case you only need to copy once, since the
>> result will be very small compared to the data.
>>
>> Then there is the complexity (additional branches, edge cases, etc.) that
>> have to be coded.  These become performance sapping as well.
>>
>> Add to this the fact that CPUs are always getting faster, and
>> advancements like extensions to the SIMD instructions mean that the
>> disparity between the offload and just doing the natural thing inline gets
>> ever smaller.
>>
>> At the end of the day, it’s often the case that your “offload” is
>> actually a performance killer.
>>
>> The exceptions to this are when the work is truly expensive.  For
>> example, running (in the old days) RSA on an offload engine makes a lot of
>> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
>> 3DES (again if you wanted to do that, which you should not) used to make
>> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
>> its a toss up.  Fletcher probably does not make sense.  If you want to
>> compress, LZJB does not make sense, but GZIP (especially at higher levels)
>> would, if you had such a device.
>>
>> Algorithms are always getting better (newer ones that are more optimized
>> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
>> best reserved for truly expensive operations for which it was designed —
>> complex transforms for 3D rendering, expensive hashing (although I wish
>> that wasn’t a thing), long running scientific analysis, machine learning,
>> etc.
>>
>> As an I/O accelerator, not so much.
>> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>
>> <thijs.cra...@gmail.com>, wrote:
>>
>> I've been searching the GitHub Repository and the Mailing list, but
>> couldn't find any discussion about this.
>> I know it's probably silly, but I would like to understand the workings.
>>
>> Let's say one could offload the Checksumming process to a dedicated GPU.
>> This might  save some amount of CPU, *but* might increase latency
>> incredibly.
>>
>> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
>> and this requires a pass of the data in-memory as it calculates the
>> checksum. If we skip this step, and instead send the data to the GPU, that
>> would also require a pass of the data (no gains there).
>>
>> The actual calculation is not that hard for a CPU it seems, there are
>> specific SIMD instructions for calculating specific Checksums, and after a
>> quick pass over the code, it seems they are already used (if available).
>>
>> I think the only time that a GPU could calculate checksums 'faster', is
>> with a form of readahead.
>> If you would  pre-read a lot of data, and dump it to the GPU's internal
>> memory, and make the GPU calculate checksums of the entire block in
>> parallel, it might be able to do it faster than a CPU.
>>
>> Has anyone considered the idea?
>>
>> - Thijs
>>
>> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer /
> see discussions <https://openzfs.topicbox.com/groups/developer> +
> participants <https://openzfs.topicbox.com/groups/developer/members> +
> delivery options
> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
> <https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0>
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M5fc111c994baa4ec7a96b27d
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] GPU Accelerated Checksumming

Reply via email to