Re: [developer] GPU Accelerated Checksumming

2022-10-19 Thread Thijs Cramer
@gregord, there are GPU's that have ECC: 
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla 
(marked with a [g]), usually the more Workstation or Datacenter focused ones. I 
think APU's leverage the main system RAM, in which case they could be using 
ECC, but the raw computational power might be insufficient.

@udo Grabowski, you are absolutely correct, but that doesn't mean that it's not 
a cool feature for high performance machines :-) Maybe AMD Ryzen APU's would 
even be useful in some cases, which are more prevalent.

@Adam Moss, nvcomp looks really cool to incorporate in @jasonlee's ZIA PR.

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mb099c14434e37b4e700427eb
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-18 Thread Rich
You know, if you'd mentioned this two weeks ago, experimenting with it
might have made it into a slide of my talk...

- Rich

On Tue, Oct 18, 2022 at 11:57 AM Adam Moss  wrote:

> In the spirit of the OP and getting expensive things onto the GPU...
> https://developer.nvidia.com/nvcomp - supports zstd, lz4, gzip and more;
> XX or XXX GB/sec etc.
> (Full disclosure: NV employs me.  I don't do anything storage or
> compression related for them though. :) )
>
> Of course, triggering any of this from kernelspace and amortizing the cost
> of getting the results back over the bus are left as an exercise for the
> reader. :D
>
> On Mon, 17 Oct 2022 at 08:45, Sanjay G Nadkarni via openzfs-developer <
> developer@lists.open-zfs.org> wrote:
>
>>
>> We have been doing regular performance runs using various workloads over
>> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years.
>> Compression is enabled for all datasets and zvols in our runs. What we have
>> observed is, under load, compression consumes the highest CPU cycles, after
>> that it is a toss up of dnode locking (a well known issue) and other things
>> that might come into play depending on the protocol.
>>
>> At least in our use cases check summing of blocks does not appear to an
>> issue.
>>
>> -Sanjay
>>
>>
>>
>> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>>
>> I can tell from past experiences that offloads like what you are
>> proposing are rarely worth it.  The set up and tear down of the mappings to
>> allow the data transport are not necessarily cheap.  You can avoid that by
>> having a preallocated region, but then you need to copy the
>> data.  Fortunately for this case you only need to copy once, since the
>> result will be very small compared to the data.
>>
>> Then there is the complexity (additional branches, edge cases, etc.) that
>> have to be coded.  These become performance sapping as well.
>>
>> Add to this the fact that CPUs are always getting faster, and
>> advancements like extensions to the SIMD instructions mean that the
>> disparity between the offload and just doing the natural thing inline gets
>> ever smaller.
>>
>> At the end of the day, it’s often the case that your “offload” is
>> actually a performance killer.
>>
>> The exceptions to this are when the work is truly expensive.  For
>> example, running (in the old days) RSA on an offload engine makes a lot of
>> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
>> 3DES (again if you wanted to do that, which you should not) used to make
>> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
>> its a toss up.  Fletcher probably does not make sense.  If you want to
>> compress, LZJB does not make sense, but GZIP (especially at higher levels)
>> would, if you had such a device.
>>
>> Algorithms are always getting better (newer ones that are more optimized
>> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
>> best reserved for truly expensive operations for which it was designed —
>> complex transforms for 3D rendering, expensive hashing (although I wish
>> that wasn’t a thing), long running scientific analysis, machine learning,
>> etc.
>>
>> As an I/O accelerator, not so much.
>> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
>> , wrote:
>>
>> I've been searching the GitHub Repository and the Mailing list, but
>> couldn't find any discussion about this.
>> I know it's probably silly, but I would like to understand the workings.
>>
>> Let's say one could offload the Checksumming process to a dedicated GPU.
>> This might  save some amount of CPU, *but* might increase latency
>> incredibly.
>>
>> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
>> and this requires a pass of the data in-memory as it calculates the
>> checksum. If we skip this step, and instead send the data to the GPU, that
>> would also require a pass of the data (no gains there).
>>
>> The actual calculation is not that hard for a CPU it seems, there are
>> specific SIMD instructions for calculating specific Checksums, and after a
>> quick pass over the code, it seems they are already used (if available).
>>
>> I think the only time that a GPU could calculate checksums 'faster', is
>> with a form of readahead.
>> If you would  pre-read a lot of data, and dump it to the GPU's internal
>> memory, and make the GPU calculate checksums of the entire block in
>> parallel, it might be able to do it faster than a CPU.
>>
>> Has anyone considered the idea?
>>
>> - Thijs
>>
>> *openzfs * / openzfs-developer /
> see discussions  +
> participants  +
> delivery options
>  Permalink
> 
>

--
openzfs

Re: [developer] GPU Accelerated Checksumming

2022-10-18 Thread René J . V . Bertin
On Tuesday October 18 2022 08:57:25 Adam Moss wrote:

>Of course, triggering any of this from kernelspace and amortizing the cost
>of getting the results back over the bus are left as an exercise for the
>reader. :D

But afterwards ... after letting the filesystem handle databases we get to let 
it do book-keeping, crypto-investments and other computational tasks as well? 
8-)


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M911bf48fe2dd4a9748cfc935
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-18 Thread Adam Moss
In the spirit of the OP and getting expensive things onto the GPU...
https://developer.nvidia.com/nvcomp - supports zstd, lz4, gzip and more; XX
or XXX GB/sec etc.
(Full disclosure: NV employs me.  I don't do anything storage or
compression related for them though. :) )

Of course, triggering any of this from kernelspace and amortizing the cost
of getting the results back over the bus are left as an exercise for the
reader. :D

On Mon, 17 Oct 2022 at 08:45, Sanjay G Nadkarni via openzfs-developer <
developer@lists.open-zfs.org> wrote:

>
> We have been doing regular performance runs using various workloads over
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years.
> Compression is enabled for all datasets and zvols in our runs. What we have
> observed is, under load, compression consumes the highest CPU cycles, after
> that it is a toss up of dnode locking (a well known issue) and other things
> that might come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an
> issue.
>
> -Sanjay
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>
> I can tell from past experiences that offloads like what you are proposing
> are rarely worth it.  The set up and tear down of the mappings to allow the
> data transport are not necessarily cheap.  You can avoid that by having a
> preallocated region, but then you need to copy the data.  Fortunately for
> this case you only need to copy once, since the result will be very small
> compared to the data.
>
> Then there is the complexity (additional branches, edge cases, etc.) that
> have to be coded.  These become performance sapping as well.
>
> Add to this the fact that CPUs are always getting faster, and advancements
> like extensions to the SIMD instructions mean that the disparity between
> the offload and just doing the natural thing inline gets ever smaller.
>
> At the end of the day, it’s often the case that your “offload” is actually
> a performance killer.
>
> The exceptions to this are when the work is truly expensive.  For example,
> running (in the old days) RSA on an offload engine makes a lot of
> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
> 3DES (again if you wanted to do that, which you should not) used to make
> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
> its a toss up.  Fletcher probably does not make sense.  If you want to
> compress, LZJB does not make sense, but GZIP (especially at higher levels)
> would, if you had such a device.
>
> Algorithms are always getting better (newer ones that are more optimized
> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
> best reserved for truly expensive operations for which it was designed —
> complex transforms for 3D rendering, expensive hashing (although I wish
> that wasn’t a thing), long running scientific analysis, machine learning,
> etc.
>
> As an I/O accelerator, not so much.
> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> , wrote:
>
> I've been searching the GitHub Repository and the Mailing list, but
> couldn't find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU.
> This might  save some amount of CPU, *but* might increase latency
> incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
> and this requires a pass of the data in-memory as it calculates the
> checksum. If we skip this step, and instead send the data to the GPU, that
> would also require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are
> specific SIMD instructions for calculating specific Checksums, and after a
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is
> with a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal
> memory, and make the GPU calculate checksums of the entire block in
> parallel, it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
>
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
>  Permalink
> 
>
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M9272d3bd96fc81b98d239469
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread jasonlee via openzfs-developer
Hi. Instead of offloading individual operations and bringing the results back, 
we have offloaded the entire write pipeline (that we use). This has resulted in 
a 16x increase in write performance with our computational storage processor.

https://github.com/openzfs/zfs/pull/13628

Jason Lee
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M00c6f7251ec0f7d8ef00e9ff
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Garrett D'Amore
Cool.  Learn something new every day. ;-)
On Oct 17, 2022, 12:26 PM -0700, Matthew Ahrens via openzfs-developer 
, wrote:
> Yes, to expand on what Rich said, there was a talk about Intel QAT offload of 
> gzip at the 2018 OpenZFS Developer Summit:
> ZFS Hardware Acceleration with QAT
> Weigang Li
> Intel
> slides
> video
> The results presented show >2x throughput with <1/2 the CPU used, and similar 
> compression to gzip software (I'm guessing with the default gzip level).
>
> QAT support has been in ZFS since 0.8.0.
>
> --matt
>
> > On Mon, Oct 17, 2022 at 12:15 PM Rich  wrote:
> > > I believe the Intel QAT support we have will happily offload gzip for 
> > > you, though I don't know if it makes any promises about what level 
> > > equivalent of gzip it hands you back...
> > >
> > > - Rich
> > >
> > > > On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore  
> > > > wrote:
> > > > > That’s about what I would have expected.
> > > > >
> > > > > Having an offload for high levels of compression (e.g. GZIP 9 or 
> > > > > something) would be cool, but I don’t think it exists yet.  And it 
> > > > > would be hard to write that in a way that doesn’t punish things for 
> > > > > the folks who *don’t* have the offload hardware.
> > > > >
> > > > > • Garrett
> > > > >
> > > > > On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via 
> > > > > openzfs-developer , wrote:
> > > > > >
> > > > > >
> > > > > > We have been doing regular performance runs using various workloads 
> > > > > > over NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few 
> > > > > > years. Compression is enabled for all datasets and zvols in our 
> > > > > > runs. What we have observed is, under load, compression consumes 
> > > > > > the highest CPU cycles, after that it is a toss up of dnode locking 
> > > > > > (a well known issue) and other things that might come into play 
> > > > > > depending on the protocol.
> > > > > >
> > > > > > At least in our use cases check summing of blocks does not appear 
> > > > > > to an issue.
> > > > > >
> > > > > > -Sanjay
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > > > > > > I can tell from past experiences that offloads like what you are 
> > > > > > > proposing are rarely worth it.  The set up and tear down of the 
> > > > > > > mappings to allow the data transport are not necessarily cheap.  
> > > > > > > You can avoid that by having a preallocated region, but then you 
> > > > > > > need to copy the data.  Fortunately for this case you only need 
> > > > > > > to copy once, since the result will be very small compared to the 
> > > > > > > data.
> > > > > > >
> > > > > > > Then there is the complexity (additional branches, edge cases, 
> > > > > > > etc.) that have to be coded.  These become performance sapping as 
> > > > > > > well.
> > > > > > >
> > > > > > > Add to this the fact that CPUs are always getting faster, and 
> > > > > > > advancements like extensions to the SIMD instructions mean that 
> > > > > > > the disparity between the offload and just doing the natural 
> > > > > > > thing inline gets ever smaller.
> > > > > > >
> > > > > > > At the end of the day, it’s often the case that your “offload” is 
> > > > > > > actually a performance killer.
> > > > > > >
> > > > > > > The exceptions to this are when the work is truly expensive.  For 
> > > > > > > example, running (in the old days) RSA on an offload engine makes 
> > > > > > > a lot of sense.  (I’m not sure it does for elliptic curve crypto 
> > > > > > > though.)  Running 3DES (again if you wanted to do that, which you 
> > > > > > > should not) used to make sense.  AES used to, but with AES-NI not 
> > > > > > > anymore.  I suspect that for SHA2 its a toss up.  Fletcher 
> > > > > > > probably does not make sense.  If you want to compress, LZJB does 
> > > > > > > not make sense, but GZIP (especially at higher levels) would, if 
> > > > > > > you had such a device.
> > > > > > >
> > > > > > > Algorithms are always getting better (newer ones that are more 
> > > > > > > optimized for actual CPUs etc.) and CPUs are always improving — 
> > > > > > > the GPU is probably best reserved for truly expensive operations 
> > > > > > > for which it was designed — complex transforms for 3D rendering, 
> > > > > > > expensive hashing (although I wish that wasn’t a thing), long 
> > > > > > > running scientific analysis, machine learning, etc.
> > > > > > >
> > > > > > > As an I/O accelerator, not so much.
> > > > > > > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> > > > > > > , wrote:
> > > > > > > > I've been searching the GitHub Repository and the Mailing list, 
> > > > > > > > but couldn't find any discussion about this.
> > > > > > > > I know it's probably silly, but I would like to understand the 
> > > > > > > > workings.
> > > > > > > >
> > > > > > > > Let's say one could offload the Checksumming process to a 
> > > > > > > > dedicated GPU. This might  save some amount of

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Matthew Ahrens via openzfs-developer
Yes, to expand on what Rich said, there was a talk about Intel QAT offload
of gzip at the 2018 OpenZFS Developer Summit:
ZFS Hardware Acceleration with QAT
 Weigang Li
Intel slides

video
The
results presented show >2x throughput with <1/2 the CPU used, and similar
compression to gzip software (I'm guessing with the default gzip level).

QAT support has been in ZFS since 0.8.0.

--matt

On Mon, Oct 17, 2022 at 12:15 PM Rich  wrote:

> I believe the Intel QAT support we have will happily offload gzip for you,
> though I don't know if it makes any promises about what level equivalent of
> gzip it hands you back...
>
> - Rich
>
> On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore 
> wrote:
>
>> That’s about what I would have expected.
>>
>> Having an offload for high levels of compression (e.g. GZIP 9 or
>> something) would be cool, but I don’t think it exists yet.  And it would be
>> hard to write that in a way that doesn’t punish things for the folks who
>> *don’t* have the offload hardware.
>>
>>- Garrett
>>
>> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
>> developer@lists.open-zfs.org>, wrote:
>>
>>
>>
>> We have been doing regular performance runs using various workloads over
>> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
>> is enabled for all datasets and zvols in our runs. What we have observed
>> is, under load, compression consumes the highest CPU cycles, after that it
>> is a toss up of dnode locking (a well known issue) and other things that
>> might come into play depending on the protocol.
>>
>> At least in our use cases check summing of blocks does not appear to an
>> issue.
>>
>> -Sanjay
>>
>>
>>
>>
>> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>>
>> I can tell from past experiences that offloads like what you are
>> proposing are rarely worth it.  The set up and tear down of the mappings to
>> allow the data transport are not necessarily cheap.  You can avoid that by
>> having a preallocated region, but then you need to copy the
>> data.  Fortunately for this case you only need to copy once, since the
>> result will be very small compared to the data.
>>
>> Then there is the complexity (additional branches, edge cases, etc.) that
>> have to be coded.  These become performance sapping as well.
>>
>> Add to this the fact that CPUs are always getting faster, and
>> advancements like extensions to the SIMD instructions mean that the
>> disparity between the offload and just doing the natural thing inline gets
>> ever smaller.
>>
>> At the end of the day, it’s often the case that your “offload” is
>> actually a performance killer.
>>
>> The exceptions to this are when the work is truly expensive.  For
>> example, running (in the old days) RSA on an offload engine makes a lot of
>> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
>> 3DES (again if you wanted to do that, which you should not) used to make
>> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
>> its a toss up.  Fletcher probably does not make sense.  If you want to
>> compress, LZJB does not make sense, but GZIP (especially at higher levels)
>> would, if you had such a device.
>>
>> Algorithms are always getting better (newer ones that are more optimized
>> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
>> best reserved for truly expensive operations for which it was designed —
>> complex transforms for 3D rendering, expensive hashing (although I wish
>> that wasn’t a thing), long running scientific analysis, machine learning,
>> etc.
>>
>> As an I/O accelerator, not so much.
>> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
>> , wrote:
>>
>> I've been searching the GitHub Repository and the Mailing list, but
>> couldn't find any discussion about this.
>> I know it's probably silly, but I would like to understand the workings.
>>
>> Let's say one could offload the Checksumming process to a dedicated GPU.
>> This might  save some amount of CPU, *but* might increase latency
>> incredibly.
>>
>> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
>> and this requires a pass of the data in-memory as it calculates the
>> checksum. If we skip this step, and instead send the data to the GPU, that
>> would also require a pass of the data (no gains there).
>>
>> The actual calculation is not that hard for a CPU it seems, there are
>> specific SIMD instructions for calculating specific Checksums, and after a
>> quick pass over the code, it seems they are already used (if available).
>>
>> I think the only time that a GPU could calculate checksums 'faster', is
>> with a form of readahead.
>> If you would  pre-read a lot of data, an

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Rich
I believe the Intel QAT support we have will happily offload gzip for you,
though I don't know if it makes any promises about what level equivalent of
gzip it hands you back...

- Rich

On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore  wrote:

> That’s about what I would have expected.
>
> Having an offload for high levels of compression (e.g. GZIP 9 or
> something) would be cool, but I don’t think it exists yet.  And it would be
> hard to write that in a way that doesn’t punish things for the folks who
> *don’t* have the offload hardware.
>
>- Garrett
>
> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
> developer@lists.open-zfs.org>, wrote:
>
>
>
> We have been doing regular performance runs using various workloads over
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
> is enabled for all datasets and zvols in our runs. What we have observed
> is, under load, compression consumes the highest CPU cycles, after that it
> is a toss up of dnode locking (a well known issue) and other things that
> might come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an
> issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>
> I can tell from past experiences that offloads like what you are proposing
> are rarely worth it.  The set up and tear down of the mappings to allow the
> data transport are not necessarily cheap.  You can avoid that by having a
> preallocated region, but then you need to copy the data.  Fortunately for
> this case you only need to copy once, since the result will be very small
> compared to the data.
>
> Then there is the complexity (additional branches, edge cases, etc.) that
> have to be coded.  These become performance sapping as well.
>
> Add to this the fact that CPUs are always getting faster, and advancements
> like extensions to the SIMD instructions mean that the disparity between
> the offload and just doing the natural thing inline gets ever smaller.
>
> At the end of the day, it’s often the case that your “offload” is actually
> a performance killer.
>
> The exceptions to this are when the work is truly expensive.  For example,
> running (in the old days) RSA on an offload engine makes a lot of
> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
> 3DES (again if you wanted to do that, which you should not) used to make
> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
> its a toss up.  Fletcher probably does not make sense.  If you want to
> compress, LZJB does not make sense, but GZIP (especially at higher levels)
> would, if you had such a device.
>
> Algorithms are always getting better (newer ones that are more optimized
> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
> best reserved for truly expensive operations for which it was designed —
> complex transforms for 3D rendering, expensive hashing (although I wish
> that wasn’t a thing), long running scientific analysis, machine learning,
> etc.
>
> As an I/O accelerator, not so much.
> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer 
> , wrote:
>
> I've been searching the GitHub Repository and the Mailing list, but
> couldn't find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU.
> This might  save some amount of CPU, *but* might increase latency
> incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
> and this requires a pass of the data in-memory as it calculates the
> checksum. If we skip this step, and instead send the data to the GPU, that
> would also require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are
> specific SIMD instructions for calculating specific Checksums, and after a
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is
> with a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal
> memory, and make the GPU calculate checksums of the entire block in
> parallel, it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
>
> *openzfs * / openzfs-developer / see
> discussions  + participants
>  + delivery options
>  Permalink
> 
>

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0
Delivery options: htt

Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Garrett D'Amore
That’s about what I would have expected.

Having an offload for high levels of compression (e.g. GZIP 9 or something) 
would be cool, but I don’t think it exists yet.  And it would be hard to write 
that in a way that doesn’t punish things for the folks who *don’t* have the 
offload hardware.

• Garrett

On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer 
, wrote:
>
>
> We have been doing regular performance runs using various workloads over 
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression 
> is enabled for all datasets and zvols in our runs. What we have observed is, 
> under load, compression consumes the highest CPU cycles, after that it is a 
> toss up of dnode locking (a well known issue) and other things that might 
> come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > I can tell from past experiences that offloads like what you are proposing 
> > are rarely worth it.  The set up and tear down of the mappings to allow the 
> > data transport are not necessarily cheap.  You can avoid that by having a 
> > preallocated region, but then you need to copy the data.  Fortunately for 
> > this case you only need to copy once, since the result will be very small 
> > compared to the data.
> >
> > Then there is the complexity (additional branches, edge cases, etc.) that 
> > have to be coded.  These become performance sapping as well.
> >
> > Add to this the fact that CPUs are always getting faster, and advancements 
> > like extensions to the SIMD instructions mean that the disparity between 
> > the offload and just doing the natural thing inline gets ever smaller.
> >
> > At the end of the day, it’s often the case that your “offload” is actually 
> > a performance killer.
> >
> > The exceptions to this are when the work is truly expensive.  For example, 
> > running (in the old days) RSA on an offload engine makes a lot of sense.  
> > (I’m not sure it does for elliptic curve crypto though.)  Running 3DES 
> > (again if you wanted to do that, which you should not) used to make sense.  
> > AES used to, but with AES-NI not anymore.  I suspect that for SHA2 its a 
> > toss up.  Fletcher probably does not make sense.  If you want to compress, 
> > LZJB does not make sense, but GZIP (especially at higher levels) would, if 
> > you had such a device.
> >
> > Algorithms are always getting better (newer ones that are more optimized 
> > for actual CPUs etc.) and CPUs are always improving — the GPU is probably 
> > best reserved for truly expensive operations for which it was designed — 
> > complex transforms for 3D rendering, expensive hashing (although I wish 
> > that wasn’t a thing), long running scientific analysis, machine learning, 
> > etc.
> >
> > As an I/O accelerator, not so much.
> > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer , 
> > wrote:
> > > I've been searching the GitHub Repository and the Mailing list, but 
> > > couldn't find any discussion about this.
> > > I know it's probably silly, but I would like to understand the workings.
> > >
> > > Let's say one could offload the Checksumming process to a dedicated GPU. 
> > > This might  save some amount of CPU, *but* might increase latency 
> > > incredibly.
> > >
> > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, 
> > > and this requires a pass of the data in-memory as it calculates the 
> > > checksum. If we skip this step, and instead send the data to the GPU, 
> > > that would also require a pass of the data (no gains there).
> > >
> > > The actual calculation is not that hard for a CPU it seems, there are 
> > > specific SIMD instructions for calculating specific Checksums, and after 
> > > a quick pass over the code, it seems they are already used (if available).
> > >
> > > I think the only time that a GPU could calculate checksums 'faster', is 
> > > with a form of readahead.
> > > If you would  pre-read a lot of data, and dump it to the GPU's internal 
> > > memory, and make the GPU calculate checksums of the entire block in 
> > > parallel, it might be able to do it faster than a CPU.
> > >
> > > Has anyone considered the idea?
> > >
> > > - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-17 Thread Sanjay G Nadkarni via openzfs-developer


We have been doing regular performance runs using various workloads over 
NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years.  
Compression is enabled for all datasets and zvols in our runs. What we 
have observed is, under load, compression consumes the highest CPU 
cycles, after that it is a toss up of dnode locking (a well known issue) 
and other things that might come into play depending on the protocol.


At least in our use cases check summing of blocks does not appear to an 
issue.


-Sanjay



On 10/14/22 10:15 AM, Garrett D'Amore wrote:

I can tell from past experiences that offloads like what you are 
proposing are rarely worth it.  The set up and tear down of the 
mappings to allow the data transport are not necessarily cheap.  You 
can avoid that by having a preallocated region, but then you need to 
copy the data.  Fortunately for this case you only need to copy once, 
since the result will be very small compared to the data.


Then there is the complexity (additional branches, edge cases, etc.) 
that have to be coded.  These become performance sapping as well.


Add to this the fact that CPUs are always getting faster, and 
advancements like extensions to the SIMD instructions mean that the 
disparity between the offload and just doing the natural thing inline 
gets ever smaller.


At the end of the day, it’s often the case that your “offload” is 
actually a performance killer.


The exceptions to this are when the work is truly expensive.  For 
example, running (in the old days) RSA on an offload engine makes a 
lot of sense.  (I’m not sure it does for elliptic curve crypto 
though.)  Running 3DES (again if you wanted to do that, which you 
should not) used to make sense.  AES used to, but with AES-NI not 
anymore.  I suspect that for SHA2 its a toss up.  Fletcher probably 
does not make sense.  If you want to compress, LZJB does not make 
sense, but GZIP (especially at higher levels) would, if you had such a 
device.


Algorithms are always getting better (newer ones that are more 
optimized for actual CPUs etc.) and CPUs are always improving — the 
GPU is probably best reserved for truly expensive operations for which 
it was designed — complex transforms for 3D rendering, expensive 
hashing (although I wish that wasn’t a thing), long running scientific 
analysis, machine learning, etc.


As an I/O accelerator, not so much.
On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer , 
wrote:
I've been searching the GitHub Repository and the Mailing list, but 
couldn't find any discussion about this.

I know it's probably silly, but I would like to understand the workings.

Let's say one could offload the Checksumming process to a dedicated 
GPU. This might  save some amount of CPU, *but* might increase 
latency incredibly.


To my understanding ZFS uses the Fletcher4 Checksum Algorithm by 
default, and this requires a pass of the data in-memory as it 
calculates the checksum. If we skip this step, and instead send the 
data to the GPU, that would also require a pass of the data (no gains 
there).


The actual calculation is not that hard for a CPU it seems, there are 
specific SIMD instructions for calculating specific Checksums, and 
after a quick pass over the code, it seems they are already used (if 
available).


I think the only time that a GPU could calculate checksums 'faster', 
is with a form of readahead.
If you would  pre-read a lot of data, and dump it to the GPU's 
internal memory, and make the GPU calculate checksums of the entire 
block in parallel, it might be able to do it faster than a CPU.


Has anyone considered the idea?

- Thijs
*openzfs * / openzfs-developer / 
see discussions  + 
participants  + 
delivery options 
 Permalink 
 


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M718cf4283623ae2e907b2356
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-14 Thread Gregor Kopka (@zfs-discuss)




On 14.10.22 16:51, Thijs Cramer wrote:
I've been searching the GitHub Repository and the Mailing list, but 
couldn't find any discussion about this.

I know it's probably silly, but I would like to understand the workings.

Let's say one could offload the Checksumming process to a dedicated 
GPU. This might  save some amount of CPU, *but* might increase latency 
incredibly.


To my understanding ZFS uses the Fletcher4 Checksum Algorithm by 
default, and this requires a pass of the data in-memory as it 
calculates the checksum. If we skip this step, and instead send the 
data to the GPU, that would also require a pass of the data (no gains 
there).


The actual calculation is not that hard for a CPU it seems, there are 
specific SIMD instructions for calculating specific Checksums, and 
after a quick pass over the code, it seems they are already used (if 
available).


I think the only time that a GPU could calculate checksums 'faster', 
is with a form of readahead.
If you would  pre-read a lot of data, and dump it to the GPU's 
internal memory, and make the GPU calculate checksums of the entire 
block in parallel, it might be able to do it faster than a CPU.


Has anyone considered the idea?


Video Cards ususally don't have ECC RAM,
healthy data in, garbage out?

Gregor

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M2ee97cefe293c85251d84022
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-14 Thread Garrett D'Amore
I can tell from past experiences that offloads like what you are proposing are 
rarely worth it.  The set up and tear down of the mappings to allow the data 
transport are not necessarily cheap.  You can avoid that by having a 
preallocated region, but then you need to copy the data.  Fortunately for this 
case you only need to copy once, since the result will be very small compared 
to the data.

Then there is the complexity (additional branches, edge cases, etc.) that have 
to be coded.  These become performance sapping as well.

Add to this the fact that CPUs are always getting faster, and advancements like 
extensions to the SIMD instructions mean that the disparity between the offload 
and just doing the natural thing inline gets ever smaller.

At the end of the day, it’s often the case that your “offload” is actually a 
performance killer.

The exceptions to this are when the work is truly expensive.  For example, 
running (in the old days) RSA on an offload engine makes a lot of sense.  (I’m 
not sure it does for elliptic curve crypto though.)  Running 3DES (again if you 
wanted to do that, which you should not) used to make sense.  AES used to, but 
with AES-NI not anymore.  I suspect that for SHA2 its a toss up.  Fletcher 
probably does not make sense.  If you want to compress, LZJB does not make 
sense, but GZIP (especially at higher levels) would, if you had such a device.

Algorithms are always getting better (newer ones that are more optimized for 
actual CPUs etc.) and CPUs are always improving — the GPU is probably best 
reserved for truly expensive operations for which it was designed — complex 
transforms for 3D rendering, expensive hashing (although I wish that wasn’t a 
thing), long running scientific analysis, machine learning, etc.

As an I/O accelerator, not so much.
On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer , wrote:
> I've been searching the GitHub Repository and the Mailing list, but couldn't 
> find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU. This 
> might  save some amount of CPU, *but* might increase latency incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, and 
> this requires a pass of the data in-memory as it calculates the checksum. If 
> we skip this step, and instead send the data to the GPU, that would also 
> require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are 
> specific SIMD instructions for calculating specific Checksums, and after a 
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is with 
> a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal 
> memory, and make the GPU calculate checksums of the entire block in parallel, 
> it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink

--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M522b09520eb8e026499c20e8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-14 Thread Nazim Can Bedir via openzfs-developer
In addition to that, even if it does have there is a high possibility of being 
used for intensive GPGPU calculations.
 Original Message 
On 14 Oct 2022, 18:03, Udo Grabowski (IMK) wrote:

> On 14/10/2022 16:51, Thijs Cramer wrote: > I've been searching the GitHub 
> Repository and the Mailing list, but couldn't > find any discussion about 
> this. > I know it's probably silly, but I would like to understand the 
> workings. > > Let's say one could offload the Checksumming process to a 
> dedicated GPU. This > might save some amount of CPU, *but* might increase 
> latency incredibly. > > To my understanding ZFS uses the Fletcher4 Checksum 
> Algorithm by default, and > this requires a pass of the data in-memory as it 
> calculates the checksum. If we > skip this step, and instead send the data to 
> the GPU, that would also require a > pass of the data (no gains there). > > 
> The actual calculation is not that hard for a CPU it seems, there are 
> specific > SIMD instructions for calculating specific Checksums, and after a 
> quick pass > over the code, it seems they are already used (if available). > 
> > I think the only time that a GPU could calculate checksums 'faster', is 
> with a > form of readahead. > If you would pre-read a lot of data, and dump 
> it to the GPU's internal memory, > and make the GPU calculate checksums of 
> the entire block in parallel, it might > be able to do it faster than a CPU. 
> > > Has anyone considered the idea? A typical ZFS server simply does not have 
> a GPU. -- Dr.Udo Grabowski Inst.f.Meteorology & Climate Research IMK-ASF-SAT 
> http://www.imk-asf.kit.edu/english/sat.php KIT - Karlsruhe Institute of 
> Technology http://www.kit.edu Postfach 3640,76021 Karlsruhe,Germany 
> T:(+49)721 608-26026 F:-926026 -- 
> openzfs: openzfs-developer Permalink: 
> https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdbb91bfa30ea2cdfdf9e24db
>  Delivery options: https://openzfs.topicbox.com/groups/developer/subscription
--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M729ffb93eba9729ad028c8c8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription


Re: [developer] GPU Accelerated Checksumming

2022-10-14 Thread Udo Grabowski (IMK)

On 14/10/2022 16:51, Thijs Cramer wrote:

I've been searching the GitHub Repository and the Mailing list, but couldn't
find any discussion about this.
I know it's probably silly, but I would like to understand the workings.

Let's say one could offload the Checksumming process to a dedicated GPU. This
might  save some amount of CPU, *but* might increase latency incredibly.

To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, and
this requires a pass of the data in-memory as it calculates the checksum. If we
skip this step, and instead send the data to the GPU, that would also require a
pass of the data (no gains there).

The actual calculation is not that hard for a CPU it seems, there are specific
SIMD instructions for calculating specific Checksums, and after a quick pass
over the code, it seems they are already used (if available).

I think the only time that a GPU could calculate checksums 'faster', is with a
form of readahead.
If you would  pre-read a lot of data, and dump it to the GPU's internal memory,
and make the GPU calculate checksums of the entire block in parallel, it might
be able to do it faster than a CPU.

Has anyone considered the idea?


A typical ZFS server simply does not have a GPU.
--
Dr.Udo Grabowski   Inst.f.Meteorology & Climate Research IMK-ASF-SAT
http://www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology   http://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026



smime.p7s
Description: S/MIME Cryptographic Signature


--
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdbb91bfa30ea2cdfdf9e24db
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription