Re: [developer] GPU Accelerated Checksumming

Garrett D'Amore Mon, 17 Oct 2022 13:59:38 -0700

Cool.  Learn something new every day. ;-)
On Oct 17, 2022, 12:26 PM -0700, Matthew Ahrens via openzfs-developer 
<developer@lists.open-zfs.org>, wrote:
> Yes, to expand on what Rich said, there was a talk about Intel QAT offload of 
> gzip at the 2018 OpenZFS Developer Summit:
> ZFS Hardware Acceleration with QAT
> Weigang Li
> Intel
> slides
> video
> The results presented show >2x throughput with <1/2 the CPU used, and similar 
> compression to gzip software (I'm guessing with the default gzip level).
>
> QAT support has been in ZFS since 0.8.0.
>
> --matt
>
> > On Mon, Oct 17, 2022 at 12:15 PM Rich <rincebr...@gmail.com> wrote:
> > > I believe the Intel QAT support we have will happily offload gzip for 
> > > you, though I don't know if it makes any promises about what level 
> > > equivalent of gzip it hands you back...
> > >
> > > - Rich
> > >
> > > > On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore <garr...@damore.org> 
> > > > wrote:
> > > > > That’s about what I would have expected.
> > > > >
> > > > > Having an offload for high levels of compression (e.g. GZIP 9 or 
> > > > > something) would be cool, but I don’t think it exists yet.  And it 
> > > > > would be hard to write that in a way that doesn’t punish things for 
> > > > > the folks who *don’t* have the offload hardware.
> > > > >
> > > > > • Garrett
> > > > >
> > > > > On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via 
> > > > > openzfs-developer <developer@lists.open-zfs.org>, wrote:
> > > > > >
> > > > > >
> > > > > > We have been doing regular performance runs using various workloads 
> > > > > > over NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few 
> > > > > > years. Compression is enabled for all datasets and zvols in our 
> > > > > > runs. What we have observed is, under load, compression consumes 
> > > > > > the highest CPU cycles, after that it is a toss up of dnode locking 
> > > > > > (a well known issue) and other things that might come into play 
> > > > > > depending on the protocol.
> > > > > >
> > > > > > At least in our use cases check summing of blocks does not appear 
> > > > > > to an issue.
> > > > > >
> > > > > > -Sanjay
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > > > > > > I can tell from past experiences that offloads like what you are 
> > > > > > > proposing are rarely worth it.  The set up and tear down of the 
> > > > > > > mappings to allow the data transport are not necessarily cheap.  
> > > > > > > You can avoid that by having a preallocated region, but then you 
> > > > > > > need to copy the data.  Fortunately for this case you only need 
> > > > > > > to copy once, since the result will be very small compared to the 
> > > > > > > data.
> > > > > > >
> > > > > > > Then there is the complexity (additional branches, edge cases, 
> > > > > > > etc.) that have to be coded.  These become performance sapping as 
> > > > > > > well.
> > > > > > >
> > > > > > > Add to this the fact that CPUs are always getting faster, and 
> > > > > > > advancements like extensions to the SIMD instructions mean that 
> > > > > > > the disparity between the offload and just doing the natural 
> > > > > > > thing inline gets ever smaller.
> > > > > > >
> > > > > > > At the end of the day, it’s often the case that your “offload” is 
> > > > > > > actually a performance killer.
> > > > > > >
> > > > > > > The exceptions to this are when the work is truly expensive.  For 
> > > > > > > example, running (in the old days) RSA on an offload engine makes 
> > > > > > > a lot of sense.  (I’m not sure it does for elliptic curve crypto 
> > > > > > > though.)  Running 3DES (again if you wanted to do that, which you 
> > > > > > > should not) used to make sense.  AES used to, but with AES-NI not 
> > > > > > > anymore.  I suspect that for SHA2 its a toss up.  Fletcher 
> > > > > > > probably does not make sense.  If you want to compress, LZJB does 
> > > > > > > not make sense, but GZIP (especially at higher levels) would, if 
> > > > > > > you had such a device.
> > > > > > >
> > > > > > > Algorithms are always getting better (newer ones that are more 
> > > > > > > optimized for actual CPUs etc.) and CPUs are always improving — 
> > > > > > > the GPU is probably best reserved for truly expensive operations 
> > > > > > > for which it was designed — complex transforms for 3D rendering, 
> > > > > > > expensive hashing (although I wish that wasn’t a thing), long 
> > > > > > > running scientific analysis, machine learning, etc.
> > > > > > >
> > > > > > > As an I/O accelerator, not so much.
> > > > > > > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> > > > > > > <thijs.cra...@gmail.com>, wrote:
> > > > > > > > I've been searching the GitHub Repository and the Mailing list, 
> > > > > > > > but couldn't find any discussion about this.
> > > > > > > > I know it's probably silly, but I would like to understand the 
> > > > > > > > workings.
> > > > > > > >
> > > > > > > > Let's say one could offload the Checksumming process to a 
> > > > > > > > dedicated GPU. This might  save some amount of CPU, *but* might 
> > > > > > > > increase latency incredibly.
> > > > > > > >
> > > > > > > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm 
> > > > > > > > by default, and this requires a pass of the data in-memory as 
> > > > > > > > it calculates the checksum. If we skip this step, and instead 
> > > > > > > > send the data to the GPU, that would also require a pass of the 
> > > > > > > > data (no gains there).
> > > > > > > >
> > > > > > > > The actual calculation is not that hard for a CPU it seems, 
> > > > > > > > there are specific SIMD instructions for calculating specific 
> > > > > > > > Checksums, and after a quick pass over the code, it seems they 
> > > > > > > > are already used (if available).
> > > > > > > >
> > > > > > > > I think the only time that a GPU could calculate checksums 
> > > > > > > > 'faster', is with a form of readahead.
> > > > > > > > If you would  pre-read a lot of data, and dump it to the GPU's 
> > > > > > > > internal memory, and make the GPU calculate checksums of the 
> > > > > > > > entire block in parallel, it might be able to do it faster than 
> > > > > > > > a CPU.
> > > > > > > >
> > > > > > > > Has anyone considered the idea?
> > > > > > > >
> > > > > > > > - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink


------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M9d1dd333674db9391bc0a362
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] GPU Accelerated Checksumming

Reply via email to