RE: CRC32C Parallel Computation Optimization on ARM

2023-12-03 Thread Xiang Gao
On Date: Thu, 30 Nov 2023 14:54:26PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
>>
>> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
>> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.
>>
>> We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,
>>
>> If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
>>   CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
>> fi
>>
>> then in makefile,
>> pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }
>
>Ah, I see.  We need to append +crc and/or +crypto based on what the
>compiler understands.  That seems fine to me...

This is the latest patch. Looking forward to your feedback, thanks!
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0008-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0008-crc32c-parallel-computation-optimization-on-arm.patch


RE: Question about the Implementation of vector32_is_highbit_set on ARM

2023-11-23 Thread Xiang Gao
On Date: Mon, 20 Nov 2023 16:05:43PM +0700, John Naylor wrote:

>On Wed, Nov 8, 2023 at 2:44=E2=80=AFPM Xiang Gao  wrote:
>>  * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i=
>.e.
>>  * check each 32-bit element, but that would require an additional mask
>>  * operation on x86.
>>  */
>
>> But I still don't understand why the vmaxvq_u32 intrinsic  is not used on=
 the arm platform.

>The current use case expects all 1's or all 0's in a 32-bit lane. If
>anyone tried using it for arbitrary values, vmaxvq_u32 could give a
>different answer than on x86 using _mm_movemask_epi8, so I think
>that's the origin of that comment. But it's still a maintenance hazard
>as is, since x86 wouldn't work for arbitrary values. It seems the path
>forward is to rename this function to vector32_is_any_lane_set(), as
>in the attached (untested on Arm). That would allow each
>implementation to use the most efficient path, whether it's by 8- or
>32-bit lanes. If we someday needed to look at only the high bits, we
>would need a new function that performed the necessary masking on x86.
>
>It's possible this method could shave cycles on Arm in some 8-bit lane
>cases where we don't actually care about the high bit specifically,
>since the movemask equivalent is slow on that platform, but I haven't
>looked yet.

Thank you for your detailed explanation.
Can I do some testing and submit this patch?
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


RE: CRC32C Parallel Computation Optimization on ARM

2023-11-23 Thread Xiang Gao
On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:

>> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>>>+__attribute__((target("+crc+crypto")))
>>>
>>>I'm not sure we can assume that all compilers will understand this, and I'm
>>>not sure we need it.
>>
>> CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported,
>> __attribute__ is also supported.

>IMHO we should stick with CFLAGS_CRC for now.  If we want to switch to
>using __attribute__((target("..."))), I think we should do so in a separate
>patch.  We are cautious about checking the availability of an attribute
>before using it (see c.h), and IIUC we'd need to verify that this works for
>all supported compilers that can target ARM before removing CFLAGS_CRC
>here.

I agree.

>> In addition, I am not sure about the source file pg_crc32c_armv8.c, if
>> CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it
>> be expressed in the makefile?
>
>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

It does not work correctly. CFLAGS ='-march=armv8-a+crc, 
-march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.

We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,

If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
  CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
fi

then in makefile,
pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

And same thing in meson.build.  In src/port/meson.build,

replace_funcs_pos = [
  # arm / aarch64
  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C', 'crc_crypto'],
  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 
'crc_crypto'],
  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
]
'pg_crc32c_armv8' also needs 'crc_crypto' when 'USE_ARMV8_CRC32C'.

Looking forward to your feedback, thanks!

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




RE: CRC32C Parallel Computation Optimization on ARM

2023-11-22 Thread Xiang Gao
On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:

>-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
>-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
>Why are these lines deleted?
>
>-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
>+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
>
>What is the purpose of this change?

Because I added `__attribute__((target("+crc+crypto")))` before the functions 
that require crc extension and crypto extension, so they are removed here.

>+__attribute__((target("+crc+crypto")))
>
>I'm not sure we can assume that all compilers will understand this, and I'm
>not sure we need it.

CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported, 
__attribute__ is also supported.
In addition, I am not sure about the source file pg_crc32c_armv8.c, if 
CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it be 
expressed in the makefile?



IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0007-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0007-crc32c-parallel-computation-optimization-on-arm.patch


Question about the Implementation of vector32_is_highbit_set on ARM

2023-11-07 Thread Xiang Gao
Hi all,
I have some questions about the implementation of vector32_is_highbit_set  on 
arm.
Below is the comment and the implementation for this function.
/*
 * Exactly like vector8_is_highbit_set except for the input type, so it
 * looks at each byte separately.
 *
 * XXX x86 uses the same underlying type for 8-bit, 16-bit, and 32-bit
 * integer elements, but Arm does not, hence the need for a separate
 * function. We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e.
 * check each 32-bit element, but that would require an additional mask
 * operation on x86.
 */
#ifndef USE_NO_SIMD
static inline bool
vector32_is_highbit_set(const Vector32 v)
{
#if defined(USE_NEON)
return vector8_is_highbit_set((Vector8) v);
#else
return vector8_is_highbit_set(v);
#endif
}
#endif  /* ! USE_NO_SIMD */

But I still don't understand why the vmaxvq_u32 intrinsic  is not used on the 
arm platform.
We have used the macro USE_NEON to distinguish different platforms.
In addition, according to the "Arm Neoverse N1 Software Optimization Guide",
The vmaxvq_u32 intrinsic  has half the latency of vmaxvq_u8 and twice the 
bandwidth.
So I think just use vmaxvq_u32 directly.

Any comments or feedback are welcome.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




RE: CRC32C Parallel Computation Optimization on ARM

2023-11-07 Thread Xiang Gao
On Mon, 6 Nov 2023 13:16:13PM -0600, Nathan Bossart wrote:
>>> The idea is that we don't want to start forcing runtime checks on builds
>>>where we aren't already doing runtime checks.  IOW if the compiler can use
>>>the ARMv8 CRC instructions with the default compiler flags, we should only
>>>use vmull_p64() if it can also be used with the default compiler flags.
>>
>>This is the newest patch, I think the code for which crc algorithm to
>>choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only,
>>and do runtime checks on the vmull_p64 instruction at all times. This
>>will not affect the existing builds, because this is a new instruction
>>and new logic. In addition, it  can also reduce the complexity of the
>>code.

>I don't think we can.  AFAICT a runtime check necessitates a function
>pointer or a branch, both of which incurred an impact on performance in my
>tests.  It looks like this latest patch still does the runtime check even
>for the USE_ARMV8_CRC32C case.

I think I understand what you mean, this is the latest patch. Thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0006-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0006-crc32c-parallel-computation-optimization-on-arm.patch


RE: CRC32C Parallel Computation Optimization on ARM

2023-11-03 Thread Xiang Gao
On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:

>On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:
>> After reading the discussion, I understand that in order to avoid performance
>> regression in some instances, we need to try our best to avoid runtime 
>> checks.
> >I don't know if I understand it correctly.

>The idea is that we don't want to start forcing runtime checks on builds
>where we aren't already doing runtime checks.  IOW if the compiler can use
>the ARMv8 CRC instructions with the default compiler flags, we should only
>use vmull_p64() if it can also be used with the default compiler flags.  I
>suspect this limitation sounds worse than it actually is in practice.  The
>vast majority of the buildfarm uses runtime checks, and at least some of
>the platforms that don't, such as the Apple M-series machines, seem to
>include +crypto by default.

>Of course, if a compiler picks up +crc but not +crypto in its defaults, we
>could lose the vmull_p64() optimization on that platform.  But as noted in
>the other thread, we can revisit if these kinds of hypothetical situations
>become reality.

>> Could you please give me some suggestions about how to refine this patch?

>Of course.  I think we'll ultimately want to independently check for the
>availability of the new instruction like we do for the other sets of
>intrinsics:
>
>   PGAC_ARMV8_VMULL_INTRINSICS([])
>   if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
>   PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
>   fi
>
>My current thinking is that we'll want to add USE_ARMV8_VMULL and
>USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
>compile.  I'll admit I haven't fully thought through every detail yet, but
>I'm cautiously optimistic that we can avoid too much complexity in the
>autoconf/meson scripts.

Thank you so much!
This is the newest patch, I think the code for which crc algorithm to choose is 
a bit complicated. Maybe we can just use USE_ARMV8_VMULL only, and do runtime 
checks on the vmull_p64 instruction at all times. This will not affect the 
existing builds, because this is a new instruction and new logic. In addition, 
it  can also reduce the complexity of the code.
Very much looking forward to receiving your suggestions, thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0005-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0005-crc32c-parallel-computation-optimization-on-arm.patch


RE: CRC32C Parallel Computation Optimization on ARM

2023-11-02 Thread Xiang Gao
On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote:
>> Thanks.  I went ahead and split this prerequisite part out to a separate
>> thread [0] since it's sort-of unrelated to your proposal here.  It's not
>> really a prerequisite, but I do think it will simplify things a bit.

>Per the other thread [0], we should try to avoid the runtime check when
>possible, as it seems to produce a small regression.  This means that if
>the ARMv8 CRC instructions are found with the default compiler flags, we
>can only use vmull_p64() if it can also be used with the default flags.
>Otherwise, we can just do the runtime check.

>[0] https://postgr.es/m/2620794.1698783160%40sss.pgh.pa.us

After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
if so, we need to check whether to use the ARM CRC32C and VMULL instruction
directly or with runtime check. There will be many scenarios here and the code
will be more complicated.
Could you please give me some suggestions about how to refine this patch?
Thanks very much!


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




RE: CRC32C Parallel Computation Optimization on ARM

2023-10-27 Thread Xiang Gao
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>> We consider that a runtime check needs to be done in any scenario.
>> Here we only confirm that the compilation can be successful.
> >A runtime check will be done when choosing which algorithm.
> >You can think of us as merging USE_ARMV8_VMULL and 
> >USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>cases with this patch.  If so, maybe we should just remove
>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0004-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0004-crc32c-parallel-computation-optimization-on-arm.patch


RE: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Xiang Gao
On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
>I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
>without the patch and around 7.4 seconds with it (an 8% improvement).
>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size 
data? Thanks!


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.




RE: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Xiang Gao
On Wed,  25 Oct,  2023 at 10:43:25 -0500, Nathan Bossart wrote:
>+pg_crc32c
>+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

>It looks like most of this function is duplicated from
>pg_comp_crc32c_armv8().  I understand that we probably need a separate
>function because of the runtime check, but perhaps we could create a common
>static inline helper function with a branch for when vmull_p64() can be
>used.  It's callers would then just provide a boolean to indicate which
>branch to take.

I have modified and remade the patch.

>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || 
>test  x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
>+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
>+USE_ARMV8_VMULL=1
>+  fi
>+fi

>Hm.  I wonder if we need to switch to a runtime check in some cases.  For
>example, what happens if the ARMv8 intrinsics used today are found with the
>default compiler flags, but vmull_p64() is only available if
>-march=armv8-a+crypto is added?  It looks like the precedent is to use a
>runtime check if we need extra CFLAGS to produce code that uses the
>intrinsics.

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and 
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

>Separately, I wonder if we should just always do runtime checks for the CRC
>stuff whenever we can produce code with the intrinics, regardless of
>whether we need extra CFLAGS.  The check doesn't look terribly expensive,
>and it might allow us to simplify the code a bit (especially now that we
>support a few different architectures).

Yes, I think so. USE_ARMV8_CRC32C only means that the compilation is successful,
and it does not guarantee that it can run correctly on the local machine.
Therefore, a runtime check is required during actual operation.
Based on the principle of minimal changes, we plan to fix it in the next patch.
If the community agrees, we will continue to improve it later, such as merging 
x86 and arm code, etc.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0003-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0003-crc32c-parallel-computation-optimization-on-arm.patch


RE: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Xiang Gao
Thanks for your suggestion, this is the modified patch and two test files.

-Original Message-
From: Michael Paquier 
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao 
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote:
> This patch uses a parallel computing optimization algorithm to improve
> crc32c computing performance on ARM. The algorithm comes from Intel
> whitepaper:
> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
> crc32c_u64) bytes
>
> Crc32c unitest:
> https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
> Crc32c benchmark:
> https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
> It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting.  Could you attached to this thread the test files you used and the 
results obtained please?  If this data gets deleted from github, then it would 
not be possible to refer back to what you did at the related benchmark results.

Note that your patch is forgetting about meson; it just patches ./configure.
--
Michael
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0002-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0002-crc32c-parallel-computation-optimization-on-arm.patch
/*
* compile postgres first with different crc32c implementation(use arm vmull_p64 
or not)
* we should comment out some codes about elog in pg_crc32c_armv8_choose.c to 
compile correctly and simply.
* $ gcc   -I ../postgres/_install/include -I 
../postgres/_install/include/server main.c \
* -L ../postgres/build/src/port -l pgport_srv -O2  -o main

* this test was run on Neoverse-N1
* $ ./main.no_vmull
* data size is 512 bytes, and compute crc cost 139 us totally, 0.135742 us per 
loop
* data size is 4096 bytes, and compute crc cost 1061 us totally, 1.036133 us 
per loop

* $ ./main.use_vmull
* data size is 512 bytes, and compute crc cost 101 us totally, 0.098633 us per 
loop
* data size is 4096 bytes, and compute crc cost 540 us totally, 0.527344 us per 
loop

* We can see that the cost of computing crc32c without vmull_p64 is about two 
times than
* the cost that using vmull_p64 when data size is large. and the cost is almost 
same when 
* data size is small.
*/

#include 
#include 
#include 
#include 
#include 

#include "c.h"
#include "port/pg_crc32c.h"

uint64_t
GetTickCount()
{
struct timeval tv;

gettimeofday(, NULL);
return tv.tv_sec * 100 + tv.tv_usec;
}

int
main()
{
#define CASE_CNT 2
uint32_ttest_size[CASE_CNT] = {512, 1024 * 4};

for (int case_cnt = 0; case_cnt < CASE_CNT; case_cnt++)
{
uint8_t*buf = (uint8_t *) malloc(test_size[case_cnt] * 
sizeof(uint8_t));

srand(0);
for (int i = 0; i < test_size[case_cnt]; i++)
{
*(buf + i) = (uint8_t) (rand() % 256u);
}

static const uint32_t kLoop = 1024;
uint32_tcrc = 0;
uint64_tstart = GetTickCount();

INIT_CRC32C(crc);
for (int i = 0; i < kLoop; i++)
{
COMP_CRC32C(crc, buf, test_size[case_cnt]);
}
FIN_CRC32C(crc);
uint64_tstop = GetTickCount();

printf("data size is %d bytes, and compute crc cost %ld us 
totally, %f us per loop\n", test_size[case_cnt], stop - start, (double) (stop - 
start) / kLoop);

free(buf);
}
#undef CASE_CNT
return 0;
}

/***
* We use libcheck(https://github.com/libcheck/check) as unit testing framework.

* compile postgres first with different crc32c implementation(use arm crc32c
* and vmull intrisics or not). we should comment out some codes about elog in
* pg_crc32c_armv8_choose.c to compile correctly and simply.
* $ gcc -I ../postgres/_install/include -I ../postgres/_install/include/server \
  crc32c_unittest.c  -L ../postgres/build/src/port -l pgport_srv  -L 
/usr/local/lib \
  -lcheck  -o crc32c_unittest

* this test was run on Neoverse-N1
* $ ./crc32c_

CRC32C Parallel Computation Optimization on ARM

2023-10-20 Thread Xiang Gao
Hi all

This patch uses a parallel computing optimization algorithm to improve crc32c 
computing performance on ARM. The algorithm comes from Intel whitepaper: 
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three 
equal-sized blocks.Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes.One 
Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
Crc32c benchmark: 
https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
It gets ~2x speedup compared to linear Arm crc32c instructions.

I'll create a CommitFests ticket for this submission.
Any comments or feedback are welcome.

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0001-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0001-crc32c-parallel-computation-optimization-on-arm.patch


Optimize Arm64 crc32 implementation in PostgreSQL

2023-08-21 Thread Xiang Gao
Hi all,

Currently PostgreSQL has three different variants of a 32-bit CRC calculation: 
CRC-32C, CRC-32(Ethernet polynomial),
and a legacy CRC-32 version that uses the lookup table. Some ARMv8 (AArch64) 
CPUs implement the CRC32 extension which
is equivalent with CRC-32(Ethernet polynomial), so they can also benefit from 
hardware acceleration.

Can I propose a patch to optimize crc32 calculation with Arm64 specific 
instructions?

Any comments or feedback are welcome.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.