[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-30 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386859#comment-14386859
 ] 

Edward Nevill commented on HADOOP-11660:


Hi,

I have revised the patch to include the changes requested above. I have also 
updated test_bulk_crc32.c so it prints out the times for 16384 bytes @ 512 
bytes per checksum X 100 iterations for both the Castagnoli and Zlib 
polynomials.

The following are the results I get for x86_64 before and after. I have done 5 
runs of each.

BEFORE

{code}
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.8
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.84
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.1
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.85
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.94
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.81
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
{code}

AFTER

{code}
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.11
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.99
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.11
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.9
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
[ed@mylittlepony hadoop]$ 
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 1.12
CRC 16384 bytes @ 512 bytes per checksum X 100 iterations = 13.92
./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS.
{code}

Loking at the average time over 5 runs gives

BEFORE

Castagnoli average = 1.116 sec
Zlib average = 13.848 sec

AFTER

Castagnoli average = 1.116
Zlib average = 13.93

So the performance for the Castagnoli polynomial is the same. For the Zlib 
poynomial there seems to be a performance degradation of 0.6%. This may be due 
to experimental error, however this is unaccelerated in any case on x86 because 
it is not supported on x86 HW and is not used for HDFS.

For comparison, on aarch64 partner HW I get the following averages

Castagnoli = 3.586
Zlib = 3.580

Many thanks for you help with this,
Ed.


 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 

[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386921#comment-14386921
 ] 

Hadoop QA commented on HADOOP-11660:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12708163/jira-11660.patch
  against trunk revision ae3e8c6.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/6025//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382721#comment-14382721
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

Thank you for your patience, [~enevill].  I think this is almost ready to 
commit.

{code}
168 ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL aarch64)
169   set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_aarch64.c)
170 ENDIF()
{code}
Can you put in a {{MESSAGE}} here that explains that the architecture is 
unsupported in the ELSE case?  We certainly don't want to be losing hardware 
acceleration without being aware of it.

{{bulk_crc32_aarch64.c}}: you should include {{stdint.h}} here for {{uint8_t}}, 
etc.  Even though some other header is probably pulling it in now by accident.

{{bulk_crc32_x86.c}}: I would really prefer not to wrap this in a giant {{#if 
defined(__GNUC__)  !defined(__FreeBSD__)}}, especially since we're not 
wrapping the ARM version like that.  If people want this to be compiler and 
os-specific, it would be better to do it at the CMake level.  I would say just 
take that out and let people fix it if it becomes a problem for them.

Can you post before / after performance numbers for x86_64?  Maybe you can 
instrument test_bulk_crc32.c to produce those numbers.

It looks like when this was done previously, the test code was not checked in.  
See:
https://issues.apache.org/jira/browse/HADOOP-7446?focusedCommentId=13084519page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13084519

I'm sorry to dump another task on you, but my co-workers will kill me if I 
regress checksum performance.

Thanks again for working on this.  As soon as we verify that we haven't 
regressed perf, and made those minor changes, we should be good to go.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-19 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369568#comment-14369568
 ] 

Edward Nevill commented on HADOOP-11660:


Hi,

There does not seem to be a variable ARCH. The closest match seems to be 
CMAKE_SYSTEM_PROCESSOR.

I have added a condition in the CMakeLists.txt which does

{code}
IF (CMAKE_SYSTEM_PROCESSOR MATCHES ^i.86$ OR CMAKE_SYSTEM_PROCESSOR STREQUAL 
x86_64 OR CMAKE_SYSTEM_PROCESSOR STREQUAL amd64)
  set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_x86.c)
ELSEIF (CMAKE_SYSTEM_PROCESSOR STREQUAL aarch64)
  set(BULK_CRC_ARCH_SOURCE_FIlE ${D}/util/bulk_crc32_aarch64.c)
ENDIF()
{code}

And then uses BULK_CRC_ARCH_SOURCE_FIlE in the 2 cases where it is included 
later.

Thanks for your help and apologies for the amount of time this patch is taking,

Ed.


 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369661#comment-14369661
 ] 

Hadoop QA commented on HADOOP-11660:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12705611/jira-11660.patch
  against trunk revision 522c897.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5972//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-18 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368231#comment-14368231
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

sorry that I have been slow to review this.  It is difficult code to review.

If you want an example of cmake conditional compilation, look at 
./hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt and its {{if (WIN32)}}.  
In your case you would be checking on the built-in cmake variables, something 
like ARCH?  It should be in the cmake docs

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-17 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365399#comment-14365399
 ] 

Edward Nevill commented on HADOOP-11660:


Hi Colin,

My understanding was that the issue was the no of #ifdefs in the file so I 
reworked it to reduce it to a single #ifdef X86 ... #elif AARCH64 #endif.

I have prepared a new patch which separates x86 and aarch64 into two files 
bulk_crc32_x86.c and bulk_crc32_aarch64.c. bulk_crc32.c now contains just the 
fallback software implementation.

The cmakefile includes both bulk_crc32_x86.c and bulk_crc32_aarch64.c and the 
two files are then protected by

bulk_crc32_x86.c

{code}
#ifdef X86

#endif
{code}

bulk_crc32_aarch64.c

{code}
#ifdef AARCH64
...
#endif
{code}

I am not clear how to make cmake select correctly the _x86 or _aarch64 version. 
If you could help me with this then we can get rid of the final 
conditionalisation above.

I have also added back in support for the optimised 32 bit x86 implemetation.

All the best,
Ed.


 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365521#comment-14365521
 ] 

Hadoop QA commented on HADOOP-11660:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12705102/jira-11660.patch
  against trunk revision 7179f94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5958//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-13 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361313#comment-14361313
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

Hi Edward,

What's the rationale for not splitting the file into multiple files per 
architecture as I asked in this comment: 
https://issues.apache.org/jira/browse/HADOOP-11660?focusedCommentId=14350788page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350788
 ?  Also, I would prefer not to drop support for architectures, even if they 
are 32-bit x86.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-10 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355805#comment-14355805
 ] 

Edward Nevill commented on HADOOP-11660:


Hi,

I have added a new patch which reworks the code and (hopefully) makes it more 
readable. I have also attached the resultant bulk_crc32.c as the patch is 
growing rather large and difficult to understand.

I have made the following changes

1) I have removed conditionalisaton on USE_PIPELINED. I really do not 
understand the condition that reads

{code}
#if (!defined(__FreeBSD__)  !defined(WINDOWS))
#define USE_PIPELINED
#endif
{code}

USE_PIPELINED is a feature of the architecture (IE. is there any benefit to 
pipelining on your particular arch - at the moment x86 or aarch64), not the OS. 
So I really dont understand this ifdef.

If it is the case that some arch does not support pipelining then the arch must 
still implement a pipelined version, but this can be a basic shell like I have 
done for the sb8 versions.

{code}
static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3,
 const uint8_t *p_buf, size_t block_size, int 
num_blocks) {
  assert(num_blocks = 1  num_blocks =3  invalid num_blocks);
  *crc1 = crc32c_sb8(*crc1, p_buf, block_size);
  if (num_blocks = 2)
*crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size);
  if (num_blocks = 3)
*crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size);
}
{code}

So, in this case, if there is no support for pipelining, just call the non 
pipelined version 3 times with each buffer in turn.

Doing this make the code a whole lot easier to read IMHO

2) The existing code initialised a variable 
{code}cached_cpu_supports_crc32{code} and then did

{code}
  if (likely(cached_cpu_supports_crc32)) {
crc_update_func = crc32c_hardware;
{code}

The problem with this is it creates a reference to crc32c_hardware, even when 
there is no hardware support. This means it is necessary to create dummy 
crc32_hardware function when we are only doing SW crc. IE.

{code}
static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t 
length) {
  // never called!
  assert(0  hardware crc called on an unsupported platform);
  return 0;
}
{code}

Instead, what I do is initialise static function pointers in the initialisation 
function. So for example, for aarch64 I do

{code}
void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  unsigned long auxv = getauxval(AT_HWCAP);
  if (auxv  HWCAP_CRC32) {
pipelined_crc32c_func = pipelined_crc32c;
pipelined_crc32_zlib_func = pipelined_crc32_zlib;
  }
}
{code}

By default these pointers are statically initialised to the SW routines. IE.

{code}
// Satically initialise the function pointers to the software versions
// If HW is available these will be overridden by the initialisation functions
static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8;
static crc_pipelined_func_t pipelined_crc32_zlib_func = 
pipelined_crc32_zlib_sb8;
{code}

So if nobody initialises them they fall back to SW.

3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit 
x86 now falls back to SW.

Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back.

The result of these changes is to vastly reduce the amount of 
conditionalisation in the code. The conditionalisation is now reduced to

{code}
#if defined(__amd64__)  defined(__GNUC__)  !defined(__FreeBSD__)
.
#elif defined(__aarch64__) // Start ARM 64 architecture
.
#endif
{code}

I hope the above changes will make the code a whole lot more readable,

All the best,
Ed.


 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per 

[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355816#comment-14355816
 ] 

Hadoop QA commented on HADOOP-11660:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12703755/bulk_crc32.c
  against trunk revision 64eb068.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5910//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350212#comment-14350212
 ] 

Hadoop QA commented on HADOOP-11660:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch
  against trunk revision 95bfd08.

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5867//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350273#comment-14350273
 ] 

Hadoop QA commented on HADOOP-11660:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12703031/jira-11660.patch
  against trunk revision 95bfd08.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5869//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350788#comment-14350788
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

Thanks for working on this, [~enevill].  I think the #ifdefs are getting kind 
of out of control in this file.  This isn't really your fault... it's a mess 
even now.  But adding another dimension of #ifdefing just makes my brain hurt.

How about we split the hardware-specific code for x86 into 
{{bulk_crc32_x86.c}}, and the ARM-specific code off into 
{{bulk_crc32_aarch64.c}}.  CMake will compile in the relevant file based on the 
architecture.  I would say both of the platform specific files should implement 
the same functions, so that the generic part can be compiled against either.

I suppose we'd still have the USE_PIPELINED ifdefs, but one dimension of ifdefs 
is possible to understand (for me, at least).

Do you think that's possible?

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348013#comment-14348013
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

OK.  Thanks, Edward.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-04 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346793#comment-14346793
 ] 

Edward Nevill commented on HADOOP-11660:


Hi,

I did some additional benchmarking on pipelining on aarch64 and it would appear 
that I was mistaken. AArch64 is in fact capable of pipelining. The reason I was 
not seeing any improvement was that I choose too large a buffer size. In my 
test I was doing a CRC of 3 x 1MB buffers x 5000 times. The pipeline version 
shows worse performance because it is doing all three 1MB buffers in parallel 
whereas the non pipeline version processes the 1MB buffers individually which 
is more cache efficient.

I reduced the buffer size to 16KB and increased the no. of iterations from 5000 
to 50. This generated the following results.

{code}
NON PIPELINED crc1 = 783797200, crc2 = 610683550, crc3 = -1644088667
time = 6.16
PIPELINED crc1 = -2031343782, crc2 = -2043588942, crc3 = 554161471
time = 4.61
{code}

I then replaced the CRC instruction with and ADD instruction (which always 
completes in 1 cycle) and got the following result.

{code}
NON PIPELINED crc1 = -1928994468, crc2 = -1747836272, crc3 = -674545616
time = 4.13
PIPELINED crc1 = -2096826240, crc2 = -334553600, crc3 = 1911147008
time = 4.19
{code}

This clearly shows that the CRC is pipelined because it is effectively able to 
complete each CRC in a single cycle (because the pipeline version gets the same 
performance as using ADD).

My bad. I will submit another patch within the next couple of days which 
includes pipelining. Thanks for your patience!

Ed.


 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345469#comment-14345469
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

{code}
-#if (!defined(__FreeBSD__)  !defined(WINDOWS))
+#if (!defined(__FreeBSD__)  !defined(WINDOWS))  !defined(__aarch64__)
{code}

I don't understand the logic here, can you explain?  What does pipelining have 
to do with whether we are on ARM?  At minimum this needs a comment.  

{code}
   crc_update_func = crc32_zlib_sb8;
+#ifdef __aarch64__
+  if (likely(cached_cpu_supports_crc32))
+crc_update_func = crc32_zlib_hardware;
+#endif
{code}

Can you just change the definition of {{crc32_zlib_sb8}} when __aarch64__ is 
defined?  We should minimize the number of skid marks in the code.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Fix For: 3.0.0

 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-03 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345588#comment-14345588
 ] 

Edward Nevill commented on HADOOP-11660:


On 3 March 2015 at 18:38, Colin Patrick McCabe (JIRA) j...@apache.org


 The USE_PIPELINED optimises crc on x86 because x86 has 3 crc units and
therefore can essentially calculate 3 crcs in parallel. The code for the
x86 loop does

  while (likely(counter)) {
__asm__ __volatile__(
crc32q (%7), %0;\n\t
crc32q (%7,%6,1), %1;\n\t
crc32q (%7,%6,2), %2;\n\t
 : =r(c1), =r(c2), =r(c3)
 : 0(c1), 1(c2), 2(c3), r(block_size), r(data)
);
data++;
counter--;
  }

So, the 2nd crc32q executes in the shadow of the 1st (ie it executes in
what would be a result delay from the 1st crc32q in any case). Similarly
the 3rd crc32q executes in the shadow or the 1st and 2nd.

On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if
we did the same on aarch64 and had say

crc32   w0, w0, x3
crc32   w1, w1, x4
crc32   w2, w2, x5

The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete
and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so
there is no benefit to pipelining.

I did some tests with pipelining on aarch64 and it worked out marginally
slower than the non pipelined version.

Note that the pipelined_crc32 is within a #ifdef x86 section of code so
from that point of view it is architecture specific and if I enabled it I
would then have to redundantly implement the pipelined version on aarch64.



No. Because not all aarch64 HW has the CRC instructions. So we need all
three

crc32c_hardware - When called to do a castagnoli CRC on a cpu with HW
support
crc32_zlib_hardware - When called to do a zlib CRC on a cpu with HW support
crc32_zlib_sb8 - When call to do a CRC on a CPU without HW support

The problem is that the existing code assumes that HW support means only HW
support for the Castagnoli CRC.

The way to get around the #ifdef __aarch64__ is to add support for zlib CRC
on all archs. This would mean adding code to the x86 side to say that zlib
CRC is not available.

So what this would involve is.

- On the x86 side add a dummy definitions of crc32_zlib_hardware which will
never get called but is just there to satisfy the reference. It would have
an assert to ensure it is never called. This is much like the existing
dummy crc32_hardware at the very end.

static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t
length) {
  // never called!
  assert(0  hardware crc called on an unsupported platform);
  return 0;
}

- Add an additional variable cached_cpu_supports_crc32_zlib to complement
the existing cached_cpu_supports_crc32 variable.

- In the init section for x86

 void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  uint32_t ecx = cpuid(CPUID_FEATURES);
  cached_cpu_supports_crc32 = ecx  SSE42_FEATURE_BIT;
}

add an initialisation

  cached_cpu_supports_crc32_zlib = 0;

(alternatively this could be left out and rely on the static initialisation
of 0).

- In the init section for aarch64

 void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  unsigned long auxv = getauxval(AT_HWCAP);
  cached_cpu_supports_crc32 = auxv  HWCAP_CRC32;
}

add an initialisation

  cached_cpu_supports_crc32_zlib = cached_cpu_supports_crc32;

(if CRC is supported on aarch64 both variants must be supported).

- Then instead of

case CRC32_ZLIB_POLYNOMIAL:
  crc_update_func = crc32_zlib_sb8;
#ifdef __aarch64__
  if (likely(cached_cpu_supports_crc32))
crc_update_func = crc32_zlib_hardware;
#endif

we do

case CRC32_ZLIB_POLYNOMIAL:
  crc_update_func = crc32_zlib_sb8;
  if (likely(cached_cpu_supports_crc32_zlib))
crc_update_func = crc32_zlib_hardware;

This would have the advantage that there is a generic framework for
architectures to add support for 1 or both (or none) HW support for CRC
without additional #ifdefs.

So this would get rid of 2 #ifdef __aarch64__ statements leaving just the
#ifdef around the USE_PIPELINE and I do feel that adding a significant
amout of code to the aarch64 side to implement a pipelined version which
offers no performance advantages would be undesirable.

But I'm happy to do whichever you feel is best. Just let me know.

All the best,
Ed.





 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
  

[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345490#comment-14345490
 ] 

Hadoop QA commented on HADOOP-11660:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12702205/jira-11660.patch
  against trunk revision 4228de9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5833//console

This message is automatically generated.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Fix For: 3.0.0

 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-03 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345601#comment-14345601
 ] 

Allen Wittenauer commented on HADOOP-11660:
---

Please stop setting the fix version. That is set at commit time to reflect 
which version this change was committed to.

Thanks.

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
 Attachments: jira-11660.patch

   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343546#comment-14343546
 ] 

Colin Patrick McCabe commented on HADOOP-11660:
---

Hi [~enevill], I don't see the patch?  Also, it seems like this should be in 
branch-2 as well as trunk. 

 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11660) Add support for hardware crc on ARM aarch64 architecture

2015-03-02 Thread Edward Nevill (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343633#comment-14343633
 ] 

Edward Nevill commented on HADOOP-11660:


Hi Colin,

The lab went down just as I was preparing the patch. I'll attach it as soon
as the lab is back up.

All the best,
Ed.

On 2 March 2015 at 18:34, Colin Patrick McCabe (JIRA) j...@apache.org



 Add support for hardware crc on ARM aarch64 architecture
 

 Key: HADOOP-11660
 URL: https://issues.apache.org/jira/browse/HADOOP-11660
 Project: Hadoop Common
  Issue Type: Improvement
  Components: native
Affects Versions: 3.0.0
 Environment: ARM aarch64 development platform
Reporter: Edward Nevill
Assignee: Edward Nevill
Priority: Minor
  Labels: performance
   Original Estimate: 48h
  Remaining Estimate: 48h

 This patch adds support for hardware crc for ARM's new 64 bit architecture
 The patch is completely conditionalized on __aarch64__
 I have only added support for the non pipelined version as I benchmarked the 
 pipelined version on aarch64 and it showed no performance improvement.
 The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
 are supported on ARM aarch64 hardwre.
 To benchmark this I modified the test_bulk_crc32 test to print out the time 
 taken to CRC a 1MB dataset 1000 times.
 Before:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
 After:
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
 So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)