[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386859#comment-14386859 ] Edward Nevill commented on HADOOP-11660: Hi, I have revised the patch to include the changes requested above. I have also updated test_bulk_crc32.c so it prints out the times for 16384 bytes @ 512 bytes per checksum X 100 iterations for both the Castagnoli and Zlib polynomials. The following are the results I get for x86_64 before and after. I have done 5 runs of each. BEFORE {code} [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.8 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.84 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.1 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.85 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.94 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.81 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. {code} AFTER {code} [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.99 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.9 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. {code} Loking at the average time over 5 runs gives BEFORE Castagnoli average = 1.116 sec Zlib average = 13.848 sec AFTER Castagnoli average = 1.116 Zlib average = 13.93 So the performance for the Castagnoli polynomial is the same. For the Zlib poynomial there seems to be a performance degradation of 0.6%. This may be due to experimental error, however this is unaccelerated in any case on x86 because it is not supported on x86 HW and is not used for HDFS. For comparison, on aarch64 partner HW I get the following averages Castagnoli = 3.586 Zlib = 3.580 Many thanks for you help with this, Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate:
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386921#comment-14386921 ] Hadoop QA commented on HADOOP-11660: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708163/jira-11660.patch against trunk revision ae3e8c6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382721#comment-14382721 ] Colin Patrick McCabe commented on HADOOP-11660: --- Thank you for your patience, [~enevill]. I think this is almost ready to commit. {code} 168 ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL aarch64) 169 set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_aarch64.c) 170 ENDIF() {code} Can you put in a {{MESSAGE}} here that explains that the architecture is unsupported in the ELSE case? We certainly don't want to be losing hardware acceleration without being aware of it. {{bulk_crc32_aarch64.c}}: you should include {{stdint.h}} here for {{uint8_t}}, etc. Even though some other header is probably pulling it in now by accident. {{bulk_crc32_x86.c}}: I would really prefer not to wrap this in a giant {{#if defined(__GNUC__) !defined(__FreeBSD__)}}, especially since we're not wrapping the ARM version like that. If people want this to be compiler and os-specific, it would be better to do it at the CMake level. I would say just take that out and let people fix it if it becomes a problem for them. Can you post before / after performance numbers for x86_64? Maybe you can instrument test_bulk_crc32.c to produce those numbers. It looks like when this was done previously, the test code was not checked in. See: https://issues.apache.org/jira/browse/HADOOP-7446?focusedCommentId=13084519page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13084519 I'm sorry to dump another task on you, but my co-workers will kill me if I regress checksum performance. Thanks again for working on this. As soon as we verify that we haven't regressed perf, and made those minor changes, we should be good to go. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369568#comment-14369568 ] Edward Nevill commented on HADOOP-11660: Hi, There does not seem to be a variable ARCH. The closest match seems to be CMAKE_SYSTEM_PROCESSOR. I have added a condition in the CMakeLists.txt which does {code} IF (CMAKE_SYSTEM_PROCESSOR MATCHES ^i.86$ OR CMAKE_SYSTEM_PROCESSOR STREQUAL x86_64 OR CMAKE_SYSTEM_PROCESSOR STREQUAL amd64) set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_x86.c) ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL aarch64) set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_aarch64.c) ENDIF() {code} And then uses BULK_CRC_ARCH_SOURCE_FIlE in the 2 cases where it is included later. Thanks for your help and apologies for the amount of time this patch is taking, Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369661#comment-14369661 ] Hadoop QA commented on HADOOP-11660: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705611/jira-11660.patch against trunk revision 522c897. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368231#comment-14368231 ] Colin Patrick McCabe commented on HADOOP-11660: --- sorry that I have been slow to review this. It is difficult code to review. If you want an example of cmake conditional compilation, look at ./hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt and its {{if (WIN32)}}. In your case you would be checking on the built-in cmake variables, something like ARCH? It should be in the cmake docs Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365399#comment-14365399 ] Edward Nevill commented on HADOOP-11660: Hi Colin, My understanding was that the issue was the no of #ifdefs in the file so I reworked it to reduce it to a single #ifdef X86 ... #elif AARCH64 #endif. I have prepared a new patch which separates x86 and aarch64 into two files bulk_crc32_x86.c and bulk_crc32_aarch64.c. bulk_crc32.c now contains just the fallback software implementation. The cmakefile includes both bulk_crc32_x86.c and bulk_crc32_aarch64.c and the two files are then protected by bulk_crc32_x86.c {code} #ifdef X86 #endif {code} bulk_crc32_aarch64.c {code} #ifdef AARCH64 ... #endif {code} I am not clear how to make cmake select correctly the _x86 or _aarch64 version. If you could help me with this then we can get rid of the final conditionalisation above. I have also added back in support for the optimised 32 bit x86 implemetation. All the best, Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365521#comment-14365521 ] Hadoop QA commented on HADOOP-11660: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705102/jira-11660.patch against trunk revision 7179f94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361313#comment-14361313 ] Colin Patrick McCabe commented on HADOOP-11660: --- Hi Edward, What's the rationale for not splitting the file into multiple files per architecture as I asked in this comment: https://issues.apache.org/jira/browse/HADOOP-11660?focusedCommentId=14350788page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350788 ? Also, I would prefer not to drop support for architectures, even if they are 32-bit x86. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355805#comment-14355805 ] Edward Nevill commented on HADOOP-11660: Hi, I have added a new patch which reworks the code and (hopefully) makes it more readable. I have also attached the resultant bulk_crc32.c as the patch is growing rather large and difficult to understand. I have made the following changes 1) I have removed conditionalisaton on USE_PIPELINED. I really do not understand the condition that reads {code} #if (!defined(__FreeBSD__) !defined(WINDOWS)) #define USE_PIPELINED #endif {code} USE_PIPELINED is a feature of the architecture (IE. is there any benefit to pipelining on your particular arch - at the moment x86 or aarch64), not the OS. So I really dont understand this ifdef. If it is the case that some arch does not support pipelining then the arch must still implement a pipelined version, but this can be a basic shell like I have done for the sb8 versions. {code} static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3, const uint8_t *p_buf, size_t block_size, int num_blocks) { assert(num_blocks = 1 num_blocks =3 invalid num_blocks); *crc1 = crc32c_sb8(*crc1, p_buf, block_size); if (num_blocks = 2) *crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size); if (num_blocks = 3) *crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size); } {code} So, in this case, if there is no support for pipelining, just call the non pipelined version 3 times with each buffer in turn. Doing this make the code a whole lot easier to read IMHO 2) The existing code initialised a variable {code}cached_cpu_supports_crc32{code} and then did {code} if (likely(cached_cpu_supports_crc32)) { crc_update_func = crc32c_hardware; {code} The problem with this is it creates a reference to crc32c_hardware, even when there is no hardware support. This means it is necessary to create dummy crc32_hardware function when we are only doing SW crc. IE. {code} static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t length) { // never called! assert(0 hardware crc called on an unsupported platform); return 0; } {code} Instead, what I do is initialise static function pointers in the initialisation function. So for example, for aarch64 I do {code} void __attribute__ ((constructor)) init_cpu_support_flag(void) { unsigned long auxv = getauxval(AT_HWCAP); if (auxv HWCAP_CRC32) { pipelined_crc32c_func = pipelined_crc32c; pipelined_crc32_zlib_func = pipelined_crc32_zlib; } } {code} By default these pointers are statically initialised to the SW routines. IE. {code} // Satically initialise the function pointers to the software versions // If HW is available these will be overridden by the initialisation functions static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8; static crc_pipelined_func_t pipelined_crc32_zlib_func = pipelined_crc32_zlib_sb8; {code} So if nobody initialises them they fall back to SW. 3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit x86 now falls back to SW. Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back. The result of these changes is to vastly reduce the amount of conditionalisation in the code. The conditionalisation is now reduced to {code} #if defined(__amd64__) defined(__GNUC__) !defined(__FreeBSD__) . #elif defined(__aarch64__) // Start ARM 64 architecture . #endif {code} I hope the above changes will make the code a whole lot more readable, All the best, Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355816#comment-14355816 ] Hadoop QA commented on HADOOP-11660: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703755/bulk_crc32.c against trunk revision 64eb068. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5910//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350212#comment-14350212 ] Hadoop QA commented on HADOOP-11660: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch against trunk revision 95bfd08. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5867//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350273#comment-14350273 ] Hadoop QA commented on HADOOP-11660: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch against trunk revision 95bfd08. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350788#comment-14350788 ] Colin Patrick McCabe commented on HADOOP-11660: --- Thanks for working on this, [~enevill]. I think the #ifdefs are getting kind of out of control in this file. This isn't really your fault... it's a mess even now. But adding another dimension of #ifdefing just makes my brain hurt. How about we split the hardware-specific code for x86 into {{bulk_crc32_x86.c}}, and the ARM-specific code off into {{bulk_crc32_aarch64.c}}. CMake will compile in the relevant file based on the architecture. I would say both of the platform specific files should implement the same functions, so that the generic part can be compiled against either. I suppose we'd still have the USE_PIPELINED ifdefs, but one dimension of ifdefs is possible to understand (for me, at least). Do you think that's possible? Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348013#comment-14348013 ] Colin Patrick McCabe commented on HADOOP-11660: --- OK. Thanks, Edward. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346793#comment-14346793 ] Edward Nevill commented on HADOOP-11660: Hi, I did some additional benchmarking on pipelining on aarch64 and it would appear that I was mistaken. AArch64 is in fact capable of pipelining. The reason I was not seeing any improvement was that I choose too large a buffer size. In my test I was doing a CRC of 3 x 1MB buffers x 5000 times. The pipeline version shows worse performance because it is doing all three 1MB buffers in parallel whereas the non pipeline version processes the 1MB buffers individually which is more cache efficient. I reduced the buffer size to 16KB and increased the no. of iterations from 5000 to 50. This generated the following results. {code} NON PIPELINED crc1 = 783797200, crc2 = 610683550, crc3 = -1644088667 time = 6.16 PIPELINED crc1 = -2031343782, crc2 = -2043588942, crc3 = 554161471 time = 4.61 {code} I then replaced the CRC instruction with and ADD instruction (which always completes in 1 cycle) and got the following result. {code} NON PIPELINED crc1 = -1928994468, crc2 = -1747836272, crc3 = -674545616 time = 4.13 PIPELINED crc1 = -2096826240, crc2 = -334553600, crc3 = 1911147008 time = 4.19 {code} This clearly shows that the CRC is pipelined because it is effectively able to complete each CRC in a single cycle (because the pipeline version gets the same performance as using ADD). My bad. I will submit another patch within the next couple of days which includes pipelining. Thanks for your patience! Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345469#comment-14345469 ] Colin Patrick McCabe commented on HADOOP-11660: --- {code} -#if (!defined(__FreeBSD__) !defined(WINDOWS)) +#if (!defined(__FreeBSD__) !defined(WINDOWS)) !defined(__aarch64__) {code} I don't understand the logic here, can you explain? What does pipelining have to do with whether we are on ARM? At minimum this needs a comment. {code} crc_update_func = crc32_zlib_sb8; +#ifdef __aarch64__ + if (likely(cached_cpu_supports_crc32)) +crc_update_func = crc32_zlib_hardware; +#endif {code} Can you just change the definition of {{crc32_zlib_sb8}} when __aarch64__ is defined? We should minimize the number of skid marks in the code. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Fix For: 3.0.0 Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345588#comment-14345588 ] Edward Nevill commented on HADOOP-11660: On 3 March 2015 at 18:38, Colin Patrick McCabe (JIRA) j...@apache.org The USE_PIPELINED optimises crc on x86 because x86 has 3 crc units and therefore can essentially calculate 3 crcs in parallel. The code for the x86 loop does while (likely(counter)) { __asm__ __volatile__( crc32q (%7), %0;\n\t crc32q (%7,%6,1), %1;\n\t crc32q (%7,%6,2), %2;\n\t : =r(c1), =r(c2), =r(c3) : 0(c1), 1(c2), 2(c3), r(block_size), r(data) ); data++; counter--; } So, the 2nd crc32q executes in the shadow of the 1st (ie it executes in what would be a result delay from the 1st crc32q in any case). Similarly the 3rd crc32q executes in the shadow or the 1st and 2nd. On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if we did the same on aarch64 and had say crc32 w0, w0, x3 crc32 w1, w1, x4 crc32 w2, w2, x5 The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so there is no benefit to pipelining. I did some tests with pipelining on aarch64 and it worked out marginally slower than the non pipelined version. Note that the pipelined_crc32 is within a #ifdef x86 section of code so from that point of view it is architecture specific and if I enabled it I would then have to redundantly implement the pipelined version on aarch64. No. Because not all aarch64 HW has the CRC instructions. So we need all three crc32c_hardware - When called to do a castagnoli CRC on a cpu with HW support crc32_zlib_hardware - When called to do a zlib CRC on a cpu with HW support crc32_zlib_sb8 - When call to do a CRC on a CPU without HW support The problem is that the existing code assumes that HW support means only HW support for the Castagnoli CRC. The way to get around the #ifdef __aarch64__ is to add support for zlib CRC on all archs. This would mean adding code to the x86 side to say that zlib CRC is not available. So what this would involve is. - On the x86 side add a dummy definitions of crc32_zlib_hardware which will never get called but is just there to satisfy the reference. It would have an assert to ensure it is never called. This is much like the existing dummy crc32_hardware at the very end. static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t length) { // never called! assert(0 hardware crc called on an unsupported platform); return 0; } - Add an additional variable cached_cpu_supports_crc32_zlib to complement the existing cached_cpu_supports_crc32 variable. - In the init section for x86 void __attribute__ ((constructor)) init_cpu_support_flag(void) { uint32_t ecx = cpuid(CPUID_FEATURES); cached_cpu_supports_crc32 = ecx SSE42_FEATURE_BIT; } add an initialisation cached_cpu_supports_crc32_zlib = 0; (alternatively this could be left out and rely on the static initialisation of 0). - In the init section for aarch64 void __attribute__ ((constructor)) init_cpu_support_flag(void) { unsigned long auxv = getauxval(AT_HWCAP); cached_cpu_supports_crc32 = auxv HWCAP_CRC32; } add an initialisation cached_cpu_supports_crc32_zlib = cached_cpu_supports_crc32; (if CRC is supported on aarch64 both variants must be supported). - Then instead of case CRC32_ZLIB_POLYNOMIAL: crc_update_func = crc32_zlib_sb8; #ifdef __aarch64__ if (likely(cached_cpu_supports_crc32)) crc_update_func = crc32_zlib_hardware; #endif we do case CRC32_ZLIB_POLYNOMIAL: crc_update_func = crc32_zlib_sb8; if (likely(cached_cpu_supports_crc32_zlib)) crc_update_func = crc32_zlib_hardware; This would have the advantage that there is a generic framework for architectures to add support for 1 or both (or none) HW support for CRC without additional #ifdefs. So this would get rid of 2 #ifdef __aarch64__ statements leaving just the #ifdef around the USE_PIPELINE and I do feel that adding a significant amout of code to the aarch64 side to implement a pipelined version which offers no performance advantages would be undesirable. But I'm happy to do whichever you feel is best. Just let me know. All the best, Ed. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345490#comment-14345490 ] Hadoop QA commented on HADOOP-11660: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12702205/jira-11660.patch against trunk revision 4228de9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//console This message is automatically generated. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Fix For: 3.0.0 Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345601#comment-14345601 ] Allen Wittenauer commented on HADOOP-11660: --- Please stop setting the fix version. That is set at commit time to reflect which version this change was committed to. Thanks. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Attachments: jira-11660.patch Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343546#comment-14343546 ] Colin Patrick McCabe commented on HADOOP-11660: --- Hi [~enevill], I don't see the patch? Also, it seems like this should be in branch-2 as well as trunk. Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture
[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343633#comment-14343633 ] Edward Nevill commented on HADOOP-11660: Hi Colin, The lab went down just as I was preparing the patch. I'll attach it as soon as the lab is back up. All the best, Ed. On 2 March 2015 at 18:34, Colin Patrick McCabe (JIRA) j...@apache.org Add support for hardware crc on ARM aarch64 architecture Key: HADOOP-11660 URL: https://issues.apache.org/jira/browse/HADOOP-11660 Project: Hadoop Common Issue Type: Improvement Components: native Affects Versions: 3.0.0 Environment: ARM aarch64 development platform Reporter: Edward Nevill Assignee: Edward Nevill Priority: Minor Labels: performance Original Estimate: 48h Remaining Estimate: 48h This patch adds support for hardware crc for ARM's new 64 bit architecture The patch is completely conditionalized on __aarch64__ I have only added support for the non pipelined version as I benchmarked the pipelined version on aarch64 and it showed no performance improvement. The aarch64 version supports both Castagnoli and Zlib CRCs as both of these are supported on ARM aarch64 hardwre. To benchmark this I modified the test_bulk_crc32 test to print out the time taken to CRC a 1MB dataset 1000 times. Before: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 After: CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)