Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread jasonlee via openzfs-developer
Hi. Instead of offloading individual operations and bringing the results back, 
we have offloaded the entire write pipeline (that we use). This has resulted in 
a 16x increase in write performance with our computational storage processor.

https://github.com/openzfs/zfs/pull/13628

Jason Lee
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M00c6f7251ec0f7d8ef00e9ff
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Garrett D'Amore
Cool.  Learn something new every day. ;-)
On Oct 17, 2022, 12:26 PM -0700, Matthew Ahrens via openzfs-developer 
, wrote:
> Yes, to expand on what Rich said, there was a talk about Intel QAT offload of 
> gzip at the 2018 OpenZFS Developer Summit:
> ZFS Hardware Acceleration with QAT
> Weigang Li
> Intel
> slides
> video
> The results presented show >2x throughput with <1/2 the CPU used, and similar 
> compression to gzip software (I'm guessing with the default gzip level).
>
> QAT support has been in ZFS since 0.8.0.
>
> --matt
>
> > On Mon, Oct 17, 2022 at 12:15 PM Rich  wrote:
> > > I believe the Intel QAT support we have will happily offload gzip for 
> > > you, though I don't know if it makes any promises about what level 
> > > equivalent of gzip it hands you back...
> > >
> > > - Rich
> > >
> > > > On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore  
> > > > wrote:
> > > > > That’s about what I would have expected.
> > > > >
> > > > > Having an offload for high levels of compression (e.g. GZIP 9 or 
> > > > > something) would be cool, but I don’t think it exists yet.  And it 
> > > > > would be hard to write that in a way that doesn’t punish things for 
> > > > > the folks who *don’t* have the offload hardware.
> > > > >
> > > > > • Garrett
> > > > >
> > > > > On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via 
> > > > > openzfs-developer , wrote:
> > > > > >
> > > > > >
> > > > > > We have been doing regular performance runs using various workloads 
> > > > > > over NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few 
> > > > > > years. Compression is enabled for all datasets and zvols in our 
> > > > > > runs. What we have observed is, under load, compression consumes 
> > > > > > the highest CPU cycles, after that it is a toss up of dnode locking 
> > > > > > (a well known issue) and other things that might come into play 
> > > > > > depending on the protocol.
> > > > > >
> > > > > > At least in our use cases check summing of blocks does not appear 
> > > > > > to an issue.
> > > > > >
> > > > > > -Sanjay
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > > > > > > I can tell from past experiences that offloads like what you are 
> > > > > > > proposing are rarely worth it.  The set up and tear down of the 
> > > > > > > mappings to allow the data transport are not necessarily cheap.  
> > > > > > > You can avoid that by having a preallocated region, but then you 
> > > > > > > need to copy the data.  Fortunately for this case you only need 
> > > > > > > to copy once, since the result will be very small compared to the 
> > > > > > > data.
> > > > > > >
> > > > > > > Then there is the complexity (additional branches, edge cases, 
> > > > > > > etc.) that have to be coded.  These become performance sapping as 
> > > > > > > well.
> > > > > > >
> > > > > > > Add to this the fact that CPUs are always getting faster, and 
> > > > > > > advancements like extensions to the SIMD instructions mean that 
> > > > > > > the disparity between the offload and just doing the natural 
> > > > > > > thing inline gets ever smaller.
> > > > > > >
> > > > > > > At the end of the day, it’s often the case that your “offload” is 
> > > > > > > actually a performance killer.
> > > > > > >
> > > > > > > The exceptions to this are when the work is truly expensive.  For 
> > > > > > > example, running (in the old days) RSA on an offload engine makes 
> > > > > > > a lot of sense.  (I’m not sure it does for elliptic curve crypto 
> > > > > > > though.)  Running 3DES (again if you wanted to do that, which you 
> > > > > > > should not) used to make sense.  AES used to, but with AES-NI not 
> > > > > > > anymore.  I suspect that for SHA2 its a toss up.  Fletcher 
> > > > > > > probably does not make sense.  If you want to compress, LZJB does 
> > > > > > > not make sense, but GZIP (especially at higher levels) would, if 
> > > > > > > you had such a device.
> > > > > > >
> > > > > > > Algorithms are always getting better (newer ones that are more 
> > > > > > > optimized for actual CPUs etc.) and CPUs are always improving — 
> > > > > > > the GPU is probably best reserved for truly expensive operations 
> > > > > > > for which it was designed — complex transforms for 3D rendering, 
> > > > > > > expensive hashing (although I wish that wasn’t a thing), long 
> > > > > > > running scientific analysis, machine learning, etc.
> > > > > > >
> > > > > > > As an I/O accelerator, not so much.
> > > > > > > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> > > > > > > , wrote:
> > > > > > > > I've been searching the GitHub Repository and the Mailing list, 
> > > > > > > > but couldn't find any discussion about this.
> > > > > > > > I know it's probably silly, but I would like to understand the 
> > > > > > > > workings.
> > > > > > > >
> > > > > > > > Let's say one could offload the Checksumming process to a 
> > > > > > > > dedicated GPU. This might  save some amount 

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Matthew Ahrens via openzfs-developer
Yes, to expand on what Rich said, there was a talk about Intel QAT offload
of gzip at the 2018 OpenZFS Developer Summit:
ZFS Hardware Acceleration with QAT
 Weigang Li
Intel slides

video
The
results presented show >2x throughput with <1/2 the CPU used, and similar
compression to gzip software (I'm guessing with the default gzip level).

QAT support has been in ZFS since 0.8.0.

--matt

On Mon, Oct 17, 2022 at 12:15 PM Rich  wrote:

> I believe the Intel QAT support we have will happily offload gzip for you,
> though I don't know if it makes any promises about what level equivalent of
> gzip it hands you back...
>
> - Rich
>
> On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore 
> wrote:
>
>> That’s about what I would have expected.
>>
>> Having an offload for high levels of compression (e.g. GZIP 9 or
>> something) would be cool, but I don’t think it exists yet.  And it would be
>> hard to write that in a way that doesn’t punish things for the folks who
>> *don’t* have the offload hardware.
>>
>>- Garrett
>>
>> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
>> developer@lists.open-zfs.org>, wrote:
>>
>>
>>
>> We have been doing regular performance runs using various workloads over
>> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
>> is enabled for all datasets and zvols in our runs. What we have observed
>> is, under load, compression consumes the highest CPU cycles, after that it
>> is a toss up of dnode locking (a well known issue) and other things that
>> might come into play depending on the protocol.
>>
>> At least in our use cases check summing of blocks does not appear to an
>> issue.
>>
>> -Sanjay
>>
>>
>>
>>
>> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>>
>> I can tell from past experiences that offloads like what you are
>> proposing are rarely worth it.  The set up and tear down of the mappings to
>> allow the data transport are not necessarily cheap.  You can avoid that by
>> having a preallocated region, but then you need to copy the
>> data.  Fortunately for this case you only need to copy once, since the
>> result will be very small compared to the data.
>>
>> Then there is the complexity (additional branches, edge cases, etc.) that
>> have to be coded.  These become performance sapping as well.
>>
>> Add to this the fact that CPUs are always getting faster, and
>> advancements like extensions to the SIMD instructions mean that the
>> disparity between the offload and just doing the natural thing inline gets
>> ever smaller.
>>
>> At the end of the day, it’s often the case that your “offload” is
>> actually a performance killer.
>>
>> The exceptions to this are when the work is truly expensive.  For
>> example, running (in the old days) RSA on an offload engine makes a lot of
>> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
>> 3DES (again if you wanted to do that, which you should not) used to make
>> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
>> its a toss up.  Fletcher probably does not make sense.  If you want to
>> compress, LZJB does not make sense, but GZIP (especially at higher levels)
>> would, if you had such a device.
>>
>> Algorithms are always getting better (newer ones that are more optimized
>> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
>> best reserved for truly expensive operations for which it was designed —
>> complex transforms for 3D rendering, expensive hashing (although I wish
>> that wasn’t a thing), long running scientific analysis, machine learning,
>> etc.
>>
>> As an I/O accelerator, not so much.
>> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
>> , wrote:
>>
>> I've been searching the GitHub Repository and the Mailing list, but
>> couldn't find any discussion about this.
>> I know it's probably silly, but I would like to understand the workings.
>>
>> Let's say one could offload the Checksumming process to a dedicated GPU.
>> This might  save some amount of CPU, *but* might increase latency
>> incredibly.
>>
>> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
>> and this requires a pass of the data in-memory as it calculates the
>> checksum. If we skip this step, and instead send the data to the GPU, that
>> would also require a pass of the data (no gains there).
>>
>> The actual calculation is not that hard for a CPU it seems, there are
>> specific SIMD instructions for calculating specific Checksums, and after a
>> quick pass over the code, it seems they are already used (if available).
>>
>> I think the only time that a GPU could calculate checksums 'faster', is
>> with a form of readahead.
>> If you would  pre-read a lot of data, and dump it to 

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Rich
I believe the Intel QAT support we have will happily offload gzip for you,
though I don't know if it makes any promises about what level equivalent of
gzip it hands you back...

- Rich

On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore  wrote:

> That’s about what I would have expected.
>
> Having an offload for high levels of compression (e.g. GZIP 9 or
> something) would be cool, but I don’t think it exists yet.  And it would be
> hard to write that in a way that doesn’t punish things for the folks who
> *don’t* have the offload hardware.
>
>- Garrett
>
> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
> developer@lists.open-zfs.org>, wrote:
>
>
>
> We have been doing regular performance runs using various workloads over
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
> is enabled for all datasets and zvols in our runs. What we have observed
> is, under load, compression consumes the highest CPU cycles, after that it
> is a toss up of dnode locking (a well known issue) and other things that
> might come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an
> issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>
> I can tell from past experiences that offloads like what you are proposing
> are rarely worth it.  The set up and tear down of the mappings to allow the
> data transport are not necessarily cheap.  You can avoid that by having a
> preallocated region, but then you need to copy the data.  Fortunately for
> this case you only need to copy once, since the result will be very small
> compared to the data.
>
> Then there is the complexity (additional branches, edge cases, etc.) that
> have to be coded.  These become performance sapping as well.
>
> Add to this the fact that CPUs are always getting faster, and advancements
> like extensions to the SIMD instructions mean that the disparity between
> the offload and just doing the natural thing inline gets ever smaller.
>
> At the end of the day, it’s often the case that your “offload” is actually
> a performance killer.
>
> The exceptions to this are when the work is truly expensive.  For example,
> running (in the old days) RSA on an offload engine makes a lot of
> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
> 3DES (again if you wanted to do that, which you should not) used to make
> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
> its a toss up.  Fletcher probably does not make sense.  If you want to
> compress, LZJB does not make sense, but GZIP (especially at higher levels)
> would, if you had such a device.
>
> Algorithms are always getting better (newer ones that are more optimized
> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
> best reserved for truly expensive operations for which it was designed —
> complex transforms for 3D rendering, expensive hashing (although I wish
> that wasn’t a thing), long running scientific analysis, machine learning,
> etc.
>
> As an I/O accelerator, not so much.
> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> , wrote:
>
> I've been searching the GitHub Repository and the Mailing list, but
> couldn't find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU.
> This might  save some amount of CPU, *but* might increase latency
> incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
> and this requires a pass of the data in-memory as it calculates the
> checksum. If we skip this step, and instead send the data to the GPU, that
> would also require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are
> specific SIMD instructions for calculating specific Checksums, and after a
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is
> with a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal
> memory, and make the GPU calculate checksums of the entire block in
> parallel, it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
>
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
>  Permalink
> 
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0
Delivery options: 

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Garrett D'Amore
That’s about what I would have expected.

Having an offload for high levels of compression (e.g. GZIP 9 or something) 
would be cool, but I don’t think it exists yet.  And it would be hard to write 
that in a way that doesn’t punish things for the folks who *don’t* have the 
offload hardware.

• Garrett

On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer 
, wrote:
>
>
> We have been doing regular performance runs using various workloads over 
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression 
> is enabled for all datasets and zvols in our runs. What we have observed is, 
> under load, compression consumes the highest CPU cycles, after that it is a 
> toss up of dnode locking (a well known issue) and other things that might 
> come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > I can tell from past experiences that offloads like what you are proposing 
> > are rarely worth it.  The set up and tear down of the mappings to allow the 
> > data transport are not necessarily cheap.  You can avoid that by having a 
> > preallocated region, but then you need to copy the data.  Fortunately for 
> > this case you only need to copy once, since the result will be very small 
> > compared to the data.
> >
> > Then there is the complexity (additional branches, edge cases, etc.) that 
> > have to be coded.  These become performance sapping as well.
> >
> > Add to this the fact that CPUs are always getting faster, and advancements 
> > like extensions to the SIMD instructions mean that the disparity between 
> > the offload and just doing the natural thing inline gets ever smaller.
> >
> > At the end of the day, it’s often the case that your “offload” is actually 
> > a performance killer.
> >
> > The exceptions to this are when the work is truly expensive.  For example, 
> > running (in the old days) RSA on an offload engine makes a lot of sense.  
> > (I’m not sure it does for elliptic curve crypto though.)  Running 3DES 
> > (again if you wanted to do that, which you should not) used to make sense.  
> > AES used to, but with AES-NI not anymore.  I suspect that for SHA2 its a 
> > toss up.  Fletcher probably does not make sense.  If you want to compress, 
> > LZJB does not make sense, but GZIP (especially at higher levels) would, if 
> > you had such a device.
> >
> > Algorithms are always getting better (newer ones that are more optimized 
> > for actual CPUs etc.) and CPUs are always improving — the GPU is probably 
> > best reserved for truly expensive operations for which it was designed — 
> > complex transforms for 3D rendering, expensive hashing (although I wish 
> > that wasn’t a thing), long running scientific analysis, machine learning, 
> > etc.
> >
> > As an I/O accelerator, not so much.
> > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer , 
> > wrote:
> > > I've been searching the GitHub Repository and the Mailing list, but 
> > > couldn't find any discussion about this.
> > > I know it's probably silly, but I would like to understand the workings.
> > >
> > > Let's say one could offload the Checksumming process to a dedicated GPU. 
> > > This might  save some amount of CPU, *but* might increase latency 
> > > incredibly.
> > >
> > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, 
> > > and this requires a pass of the data in-memory as it calculates the 
> > > checksum. If we skip this step, and instead send the data to the GPU, 
> > > that would also require a pass of the data (no gains there).
> > >
> > > The actual calculation is not that hard for a CPU it seems, there are 
> > > specific SIMD instructions for calculating specific Checksums, and after 
> > > a quick pass over the code, it seems they are already used (if available).
> > >
> > > I think the only time that a GPU could calculate checksums 'faster', is 
> > > with a form of readahead.
> > > If you would  pre-read a lot of data, and dump it to the GPU's internal 
> > > memory, and make the GPU calculate checksums of the entire block in 
> > > parallel, it might be able to do it faster than a CPU.
> > >
> > > Has anyone considered the idea?
> > >
> > > - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Sanjay G Nadkarni via openzfs-developer


We have been doing regular performance runs using various workloads over 
NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years.  
Compression is enabled for all datasets and zvols in our runs. What we 
have observed is, under load, compression consumes the highest CPU 
cycles, after that it is a toss up of dnode locking (a well known issue) 
and other things that might come into play depending on the protocol.


At least in our use cases check summing of blocks does not appear to an 
issue.


-Sanjay



On 10/14/22 10:15 AM, Garrett D'Amore wrote:

I can tell from past experiences that offloads like what you are 
proposing are rarely worth it.  The set up and tear down of the 
mappings to allow the data transport are not necessarily cheap.  You 
can avoid that by having a preallocated region, but then you need to 
copy the data.  Fortunately for this case you only need to copy once, 
since the result will be very small compared to the data.


Then there is the complexity (additional branches, edge cases, etc.) 
that have to be coded.  These become performance sapping as well.


Add to this the fact that CPUs are always getting faster, and 
advancements like extensions to the SIMD instructions mean that the 
disparity between the offload and just doing the natural thing inline 
gets ever smaller.


At the end of the day, it’s often the case that your “offload” is 
actually a performance killer.


The exceptions to this are when the work is truly expensive.  For 
example, running (in the old days) RSA on an offload engine makes a 
lot of sense.  (I’m not sure it does for elliptic curve crypto 
though.)  Running 3DES (again if you wanted to do that, which you 
should not) used to make sense.  AES used to, but with AES-NI not 
anymore.  I suspect that for SHA2 its a toss up.  Fletcher probably 
does not make sense.  If you want to compress, LZJB does not make 
sense, but GZIP (especially at higher levels) would, if you had such a 
device.


Algorithms are always getting better (newer ones that are more 
optimized for actual CPUs etc.) and CPUs are always improving — the 
GPU is probably best reserved for truly expensive operations for which 
it was designed — complex transforms for 3D rendering, expensive 
hashing (although I wish that wasn’t a thing), long running scientific 
analysis, machine learning, etc.


As an I/O accelerator, not so much.
On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer , 
wrote:
I've been searching the GitHub Repository and the Mailing list, but 
couldn't find any discussion about this.

I know it's probably silly, but I would like to understand the workings.

Let's say one could offload the Checksumming process to a dedicated 
GPU. This might  save some amount of CPU, *but* might increase 
latency incredibly.


To my understanding ZFS uses the Fletcher4 Checksum Algorithm by 
default, and this requires a pass of the data in-memory as it 
calculates the checksum. If we skip this step, and instead send the 
data to the GPU, that would also require a pass of the data (no gains 
there).


The actual calculation is not that hard for a CPU it seems, there are 
specific SIMD instructions for calculating specific Checksums, and 
after a quick pass over the code, it seems they are already used (if 
available).


I think the only time that a GPU could calculate checksums 'faster', 
is with a form of readahead.
If you would  pre-read a lot of data, and dump it to the GPU's 
internal memory, and make the GPU calculate checksums of the entire 
block in parallel, it might be able to do it faster than a CPU.


Has anyone considered the idea?

- Thijs
*openzfs * / openzfs-developer / 
see discussions  + 
participants  + 
delivery options 
 Permalink 
 


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M718cf4283623ae2e907b2356
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription