date:20140610

[jira] [Commented] (HADOOP-10641) Introduce Coordination Engine

2014-06-10 Thread Aaron T. Myers (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027444#comment-14027444
 ] 

Aaron T. Myers commented on HADOOP-10641:
-

bq. I just want to make sure we are on the same page here. The intent of this 
jira is not to solve the general problem of distributed consensus. That is, I 
do not propose to build an implementation of paxos or other coordination 
algorithms here. This is only to introduce a common interface, so that real 
implementations such as ZooKeeper could be plugged into hadoop projects.

Totally get that, but I think the point still remains that there's little 
expertise for defining a common interface for coordination engines in general 
in this project, and no real reason that the Hadoop project should necessarily 
be the place where that interface is defined. The ZooKeeper project, a ZK 
sub-project, or an entirely new TLP makes more sense to me.

> Introduce Coordination Engine
> -
>
> Key: HADOOP-10641
> URL: https://issues.apache.org/jira/browse/HADOOP-10641
> Project: Hadoop Common
>  Issue Type: New Feature
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HADOOP-10641.patch, HADOOP-10641.patch, 
> HADOOP-10641.patch
>
>
> Coordination Engine (CE) is a system, which allows to agree on a sequence of 
> events in a distributed system. In order to be reliable CE should be 
> distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC, 
> zab) and have different implementations, depending on use cases, reliability, 
> availability, and performance requirements.
> CE should have a common API, so that it could serve as a pluggable component 
> in different projects. The immediate beneficiaries are HDFS (HDFS-6469) and 
> HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HADOOP-10674) Rewrite the PureJavaCrc32 loop for performance improvement

2014-06-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HADOOP-10674:
-

Attachment: c10674_20140610.patch

c10674_20140610.patch:
- performance improvement from 45% to 60%;
- using java.util.zip.CRC32 for Java 7 or above.

> Rewrite the PureJavaCrc32 loop for performance improvement
> --
>
> Key: HADOOP-10674
> URL: https://issues.apache.org/jira/browse/HADOOP-10674
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: performance, util
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: c10674_20140609.patch, c10674_20140609b.patch, 
> c10674_20140610.patch
>
>
> Below are some performance improvement opportunities performance improvement 
> in PureJavaCrc32.
> - eliminate "off += 8; len -= 8;"
> - replace T8_x_start with hard coded constants
> - eliminate c0 - c7 local variables
> In my machine, there are 30% to 50% improvement for most of the cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-10674) Rewrite the PureJavaCrc32 loop for performance improvement

2014-06-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027352#comment-14027352
 ] 

Tsz Wo Nicholas Sze commented on HADOOP-10674:
--

A little more improvement.  

java.version = 1.6.0_65
java.runtime.name = Java(TM) SE Runtime Environment
java.runtime.version = 1.6.0_65-b14-462-11M4609
java.vm.version = 20.65-b04-462
java.vm.vendor = Apple Inc.
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
java.vm.specification.version = 1.0
java.specification.version = 1.6
os.arch = x86_64
os.name = Mac OS X
os.version = 10.9.3

Performance Table (The unit is MB/sec)
|| Num Bytes ||CRC32 || PureJavaCrc32 | % diff || PureJavaCrc32new | % diff 
| % diff ||
|  1 |17.368 |174.187 | 902.9% |   173.268 | 897.6% 
|  -0.5% |
|  2 |34.361 |281.842 | 720.2% |   275.534 | 701.9% 
|  -2.2% |
|  4 |65.416 |329.511 | 403.7% |   324.046 | 395.4% 
|  -1.7% |
|  8 |   111.836 |624.884 | 458.7% |   674.412 | 503.0% 
|   7.9% |
| 16 |   177.960 |767.225 | 331.1% |   954.177 | 436.2% 
|  24.4% |
| 32 |   243.528 |926.455 | 280.4% |  1170.222 | 380.5% 
|  26.3% |
| 64 |   309.750 |   1039.408 | 235.6% |  1453.092 | 369.1% 
|  39.8% |
|128 |   359.060 |   1106.300 | 208.1% |  1555.267 | 333.1% 
|  40.6% |
|256 |   384.203 |   1128.191 | 193.6% |  1619.925 | 321.6% 
|  43.6% |
|512 |   401.706 |   1108.321 | 175.9% |  1683.524 | 319.1% 
|  51.9% |
|   1024 |   409.730 |   1191.740 | 190.9% |  1755.902 | 328.6% 
|  47.3% |
|   2048 |   410.262 |   1175.336 | 186.5% |  1786.138 | 335.4% 
|  52.0% |
|   4096 |   417.109 |   1145.619 | 174.7% |  1768.909 | 324.1% 
|  54.4% |
|   8192 |   409.864 |   1138.061 | 177.7% |  1810.518 | 341.7% 
|  59.1% |
|  16384 |   411.105 |   1072.341 | 160.8% |  1750.499 | 325.8% 
|  63.2% |
|  32768 |   418.411 |   1176.763 | 181.2% |  1790.886 | 328.0% 
|  52.2% |
|  65536 |   413.055 |   1143.868 | 176.9% |  1792.416 | 333.9% 
|  56.7% |
| 131072 |   418.510 |   1053.030 | 151.6% |  1790.235 | 327.8% 
|  70.0% |
| 262144 |   412.248 |   1185.558 | 187.6% |  1800.560 | 336.8% 
|  51.9% |
| 524288 |   417.332 |   1190.188 | 185.2% |  1812.133 | 334.2% 
|  52.3% |
|1048576 |   414.104 |   1119.253 | 170.3% |  1755.396 | 323.9% 
|  56.8% |
|2097152 |   419.225 |   1187.693 | 183.3% |  1847.922 | 340.8% 
|  55.6% |
|4194304 |   418.692 |   1171.539 | 179.8% |  1787.660 | 327.0% 
|  52.6% |
|8388608 |   412.950 |   1159.336 | 180.7% |  1688.320 | 308.8% 
|  45.6% |
|   16777216 |   416.055 |   1199.445 | 188.3% |  1727.302 | 315.2% 
|  44.0% |


> Rewrite the PureJavaCrc32 loop for performance improvement
> --
>
> Key: HADOOP-10674
> URL: https://issues.apache.org/jira/browse/HADOOP-10674
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: performance, util
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: c10674_20140609.patch, c10674_20140609b.patch
>
>
> Below are some performance improvement opportunities performance improvement 
> in PureJavaCrc32.
> - eliminate "off += 8; len -= 8;"
> - replace T8_x_start with hard coded constants
> - eliminate c0 - c7 local variables
> In my machine, there are 30% to 50% improvement for most of the cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-10389) Native RPCv9 client

2014-06-10 Thread Haohui Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027346#comment-14027346
 ] 

Haohui Mai commented on HADOOP-10389:
-

{quote}
Currently, the libraries we depend on are: libuv, for portability primitives, 
protobuf-c, for protobuf functionality, expat, for XML parsing, and 
liburiparser, for parsing URIs. None of that functionality is provided by the 
C++ standard library, so your statement is false.

A lot of this code is not new. For example, we were using tree.h (which 
implements splay trees and rb trees), previously in libhdfs. The maintenance 
burden was not high. In fact, it was zero, because we never had to fix a bug in 
tree.h. So once again, your statement is just false.

bq. htable.c got a review because it is new code. I would hardly call reviewing 
new code a "maintenance burden." And anyway, there is a standard C way to use 
hash tables... the hcreate_r, hsearch_r, and hdestroy functions. We would like 
to use the standard way, but Windows doesn't implement these functions.
{quote}

I fail to understand what point you're try to make. My point is that you can 
write much less code in a modern language with better standard libraries, which 
makes things much easier to review and maintain. For example, when you're 
working on trunk, how many times you have to put up a 200kb patch like this 
jira? How many big patches in this feature branch? Please be considerate of the 
reviewers of the patch.

{quote}
Firstly, the challenge of maintaining a consistent C++ coding style is very, 
very large. ...
For example, exceptions harm performance...
C++ library APIs have binary compatibility issues
{quote}

Arguably you can implement what you want in C++ and C equally well. Coding 
styles and performance can be a problem.

However, before any of them I'm much more concerned about the correctness of 
the current code. For example, I'm seeing the code allocates {{hadoop_err}} on 
the common paths, and it has to clean it up on all error paths. I'm also seeing 
many calls to {{strcpy()}}, as well as calls to {{*printf()}} with non constant 
format strings.

My question is that (1) whether the code contains no memory leak, no buffer 
overflow, and no format string overflow? (2) whether the code always passes the 
function pointer with the correct type? I'm perfectly happy to +1 your patches 
as long as you can show your code is indeed free of these common defects.

Given the amount of code in the branch, it might be an issue worth looking at 
some point, compared to when a merge vote is called.



> Native RPCv9 client
> ---
>
> Key: HADOOP-10389
> URL: https://issues.apache.org/jira/browse/HADOOP-10389
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: HADOOP-10388
>Reporter: Binglin Chang
>Assignee: Colin Patrick McCabe
> Attachments: HADOOP-10388.001.patch, HADOOP-10389.002.patch, 
> HADOOP-10389.004.patch, HADOOP-10389.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HADOOP-10640) Implement Namenode RPCs in HDFS native client

2014-06-10 Thread Colin Patrick McCabe (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HADOOP-10640:
--

Attachment: HADOOP-10640-pnative.004.patch

added the comments I discussed above

> Implement Namenode RPCs in HDFS native client
> -
>
> Key: HADOOP-10640
> URL: https://issues.apache.org/jira/browse/HADOOP-10640
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: native
>Affects Versions: HADOOP-10388
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Attachments: HADOOP-10640-pnative.001.patch, 
> HADOOP-10640-pnative.002.patch, HADOOP-10640-pnative.003.patch, 
> HADOOP-10640-pnative.004.patch
>
>
> Implement the parts of libhdfs that just involve making RPCs to the Namenode, 
> such as mkdir, rename, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-10389) Native RPCv9 client

2014-06-10 Thread Colin Patrick McCabe (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027288#comment-14027288
]

Colin Patrick McCabe commented on HADOOP-10389:
---

bq. What make me concerned is that the code has to bring in a lot more
dependency in plain C, which has a high cost on maintenance

Currently, the libraries we depend on are: {{libuv}}, for portability
primitives, {{protobuf-c}}, for protobuf functionality, {{expat}}, for XML
parsing, and {{liburiparser}}, for parsing URIs. None of that functionality is
provided by the C++ standard library, so your statement is false.

bq. For example, this patch at least contains implementation of linked list,
splay tress, hash tables, and rb trees. There are a lot of overheads on
implementing, reviewing and testing the code.

A lot of this code is not new. For example, we were using {{tree.h}} (which
implements splay trees and rb trees), previously in libhdfs. The maintenance
burden was not high. In fact, it was zero, because we never had to fix a bug
in {{tree.h}}. So once again, your statement is just false.

{{htable.c}} got a review because it is new code. I would hardly call
reviewing new code a "maintenance burden." And anyway, there is a standard C
way to use hash tables... the {{hcreate_r}}, {{hsearch_r}}, and {{hdestroy}}
functions. We would like to use the standard way, but Windows doesn't
implement these functions.

bq. For example, do you considering supporting filenames in unicode? That way I
think libicu might need to be brought into the picture.

First of all, the question of whether we should use libicu is independent of
the question of whether we should use C\+\+. libicu has a C interface, and the
standard C\+\+ libraries and runtime don't provide any unicode functionality
beyond what the standard C libraries provide.

Second of all, I see no reason to use libicu. All the strings we are dealing
with are UTF-8 supplied to and from protobuf. This means that they are
null-terminated and can be printed and handled with existing string functions.
libicu might come into the picture if we wanted to start normalizing unicode
strings or using wide character strings. But we don't need or want to do that.

bq. It looks to me that it is much more compelling to implement the code in a
more modern language, say, c++11, where much of the headache right now is taken
away by a mature standard library.

C++ first came on the scene in 1983. That is 31 years ago. C++ may be a lot
of things, but "modern" isn't one of them. I was a C++ programmer for 10
years. I know the language about as well as anyone can. I specifically chose
C for this project because of a few things.

Firstly, the challenge of maintaining a consistent C++ coding style is very,
very large. This is true even when everyone is a professional C++ programmer
working under the same roof. For a project like Hadoop, where C/C++ is not
everyone's first language, the challenge is just unsupportable. The C++
learning curve is just much higher than C. You have to know everything you
have to know for C, plus a lot of very tricky things that are unique to C++.

There are a lot of contentious issues in the community like use exceptions, or
don't use exceptions? Use global constructors, or don't use global
constructors? Use boost, or don't use boost? Use C++0x / C++11 / C++14 or use
some older standard? Use runtime type information ({{dynamic_cast}},
{{typeof}}), or don't use runtime type information? Operator overloading, or
no operator overloading?

There are reasonable arguments for each of these positions. For example,
exceptions harm performance (because of the need to maintain data to do stack
unwinding. See here:
http://preshing.com/20110807/the-cost-of-enabling-exception-handling/. That's
just if you don't use them... if you do use them, exceptions turn out to be a
lot slower than return codes. They also can make code difficult to follow.
C++ doesn't have checked exceptions, so you can never really know what any
function will throw. For this reason, some fairly smart people at Google have
decided to ban exceptions from their coding standard. This, in turn, means
that it's difficult for libraries to throw exceptions, since open source
projects using the Google Coding standard (and there are a lot of them) can't
deal with exceptions. Of course, without exceptions, certain things in C++ are
very hard to do. (By the way, I'm not interested in having the argument
for/against exceptions here, just in noting that there is huge fragmentation
here and reasonable people on both sides.)

A similar story could be told about all the other choices. The net effect is
that we have to police a very large set of arbitrary style decisions that just
wouldn't come up at all if we just used C.

65 matches

Mail list logo