RE: Hadoop encryption module as Apache Chimera incubator project

Chen, Haifeng Wed, 03 Feb 2016 00:07:51 -0800

Thanks Chris and Colin for your opinions.

>> [Chris] If Chimera is not successful as an independent project or stalls, 
>> Hadoop and/or Spark and/or $project will have to reabsorb it as maintainers. 
Understand the concern. One point to consider, Chimera dedicates in a specific 
domain of optimized cryptographic, like Apache commons logging dedicates in 
logging. It is not as dynamic as other Apache projects. 
Of course, as to whether it as part of Hadoop or separate, both ways have 
uncertainties. I am not strongly opposite one way or the other.


Standing in the point of shared fundamental piece of code like this, I do think 
Apache Commons might be the best direction which we can try as the first 
effort. In this direction, we still need to work with Apache Common community 
for buying in and accepting the proposal. 

On the other hand, for the direction as sub project within Hadoop, I am 
uncertain about where will the sub-project locate and how it manages to be its 
own cadence in Hadoop. Hadoop has modules like Hadoop Common, Hadoop HDFS, 
Hadoop YARN, Hadop MapReduce. And these modules have the same release cycle and 
are released together. Am I right?

>> [Colin] I do think there will be some challenges splitting this 
>> functionality out into a separate jar, because of the way our CLASSPATH 
>> works right now.
Yes, this challenges are common for shared libraries in Java. Just as you 
mentioned, keeping API compatibility or using classpath isolation are two 
practical methods.

>> [Colin] The really complicated part of bundling JNI code in a jar is that 
>> you need to create jars for every cross product.
Building does get complex for cross platform. But it might not be as complex as 
described considering the native. First, building with JDK7 or JDK8 is the 
common thing to consider for all Java libraries I think. It doesn't specific 
related to building of the JNI code. (Correct me if I am wrong). Secondly, it 
is still possible to isolate the building of the native in the way that you 
don't have to build different version for Ubuntu and RHEL. Third, if it is 
dynamic link to openssl and the openssl API used by the library is not changed 
in the versions, we don't have to build different versions for it. 

So the building matrix might be Linux32, Linux64, Windows32, Windows64, Mac... 

>>[Colin] So probably we would not choose to bundle openssl.
Agree. Bundle openssl is not a good idea considering upgrading for 
vulnerabilities.


Regards,
Haifeng

-----Original Message-----
From: Colin P. McCabe [mailto:cmcc...@apache.org] 
Sent: Wednesday, February 3, 2016 4:56 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

It's great to see interest in improving this functionality.  I think Chimera 
could be successful as an Apache project.  I don't have a strong opinion one 
way or the other as to whether it belongs as part of Hadoop or separate.

I do think there will be some challenges splitting this functionality out into 
a separate jar, because of the way our CLASSPATH works right now.  For example, 
let's say that Hadoop depends on Chimera 1.2 and Spark depends on Chimera 1.1.  
Now Spark jobs have two different versions fighting it out on the classpath, 
similar to the situation with Guava and other libraries.  Perhaps if Chimera 
adopts a policy of strong backwards compatibility, we can just always use the 
latest jar, but it still seems likely that there will be problems.  There are 
various classpath isolation ideas that could help here, but they are big 
projects in their own right and we don't have a clear timeline for them.  If 
this does end up being a separate jar, we may need to shade it to avoid all 
these issues.

Bundling the JNI glue code in the jar itself is an interesting idea, which we 
have talked about before for libhadoop.so.  It doesn't really have anything to 
do with the question of TLP vs. non-TLP, of course.
We could do that refactoring in Hadoop itself.  The really complicated part of 
bundling JNI code in a jar is that you need to create jars for every cross 
product of (JVM version, openssl version, operating system).  For example, you 
have the RHEL6 build for openJDK7 using openssl 1.0.1e.  If you change any one 
thing-- say, change openJDK7 to Oracle JDK8, then you might need to rebuild.  
And certainly using Ubuntu would be a rebuild.  And so forth.  This kind of 
clashes with Maven's philosophy of pulling prebuilt jars from the internet.

Kai Zheng's question about whether we would bundle openSSL's libraries is a 
good one.  Given the high rate of new vulnerabilities discovered in that 
library, it seems like bundling would require Hadoop users and vendors to 
update very frequently, much more frequently than Hadoop is traditionally 
updated.  So probably we would not choose to bundle openssl.

best,
Colin

On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cdoug...@apache.org> wrote:
> As a subproject of Hadoop, Chimera could maintain its own cadence.
> There's also no reason why it should maintain dependencies on other 
> parts of Hadoop, if those are separable. How is this solution 
> inadequate?
>
> If Chimera is not successful as an independent project or stalls, 
> Hadoop and/or Spark and/or $project will have to reabsorb it as 
> maintainers. Projects have high mortality in early life, and a fight 
> over inheritance/maintenance is something we'd like to avoid. If, on 
> the other hand, it develops enough of a community where it is 
> obviously viable, then we can (and should) break it out as a TLP (as 
> we have before). If other Apache projects take a dependency on 
> Chimera, we're open to adding them to security@hadoop.
>
> Unlike Yetus, which was largely rewritten right before it was made 
> into a TLP, security in Hadoop has a complicated pedigree. If Chimera 
> eventually becomes a TLP, it seems fair to include those who work on 
> it while it is a subproject. Declared upfront, that criterion is 
> fairer than any post hoc justification, and will lead to a more 
> accurate account of its community than a subset of the Hadoop 
> PMC/committers that volunteer. -C
>
>
> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <haifeng.c...@intel.com> wrote:
>> Thanks to all folks providing feedbacks and participating the discussions.
>>
>> @Owen, do you still have any concerns on going forward in the direction of 
>> Apache Commons (or other options, TLP)?
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Chen, Haifeng [mailto:haifeng.c...@intel.com]
>> Sent: Saturday, January 30, 2016 10:52 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: RE: Hadoop encryption module as Apache Chimera incubator 
>> project
>>
>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>> that moving core components out of Hadoop is bad from a project management 
>>>> perspective.
>>
>>> Although it's certainly true that encryption capabilities (in HDFS, YARN, 
>>> etc.) are becoming core to Hadoop, I don't think that should really 
>>> influence whether or not the non-Hadoop-specific encryption routines should 
>>> be part of the Hadoop code base, or part of the code base of another 
>>> project that Hadoop depends on. If Chimera had existed as a library hosted 
>>> at ASF when HDFS encryption was first developed, HDFS probably would have 
>>> just added that as a dependency and been done with it. I don't think we 
>>> would've copy/pasted the code for Chimera into the Hadoop code base.
>>
>> Agree with ATM. I want to also make an additional clarification. I agree 
>> that the encryption capabilities are becoming core to Hadoop. While this 
>> effort is to put common and shared encryption routines such as crypto stream 
>> implementations into a scope which can be widely shared across the Apache 
>> ecosystem. This doesn't move Hadoop encryption out of Hadoop (that is not 
>> possible).
>>
>> Agree if we make it a separate and independent releases project in Hadoop 
>> takes a step further than the existing approach and solve some issues (such 
>> as libhadoop.so problem). Frankly speaking, I think it is not the best 
>> option we can try. I also expect that an independent release project within 
>> Hadoop core will also complicate the existing release ideology of Hadoop 
>> release.
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Aaron T. Myers [mailto:a...@cloudera.com]
>> Sent: Friday, January 29, 2016 9:51 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>> project
>>
>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <omal...@apache.org> wrote:
>>
>>> I believe encryption is becoming a core part of Hadoop. I think that 
>>> moving core components out of Hadoop is bad from a project management 
>>> perspective.
>>>
>>
>> Although it's certainly true that encryption capabilities (in HDFS, 
>> YARN,
>> etc.) are becoming core to Hadoop, I don't think that should really 
>> influence whether or not the non-Hadoop-specific encryption routines should 
>> be part of the Hadoop code base, or part of the code base of another project 
>> that Hadoop depends on. If Chimera had existed as a library hosted at ASF 
>> when HDFS encryption was first developed, HDFS probably would have just 
>> added that as a dependency and been done with it. I don't think we would've 
>> copy/pasted the code for Chimera into the Hadoop code base.
>>
>>
>>> To put it another way, a bug in the encryption routines will likely 
>>> become a security problem that security@hadoop needs to hear about.
>>>
>> I don't think
>>> adding a separate project in the middle of that communication chain 
>>> is a good idea. The same applies to data corruption problems, and so on...
>>>
>>
>> Isn't the same true of all the libraries that Hadoop currently depends upon? 
>> If the commons-httpclient library (or commons-codec, or commons-io, or 
>> guava, or...) has a security vulnerability, we need to know about it so that 
>> we can update our dependency to a fixed version. This case doesn't seem 
>> materially different than that.
>>
>>
>>>
>>>
>>> > It may be good to keep at generalized place(As in the discussion, 
>>> > we thought that place could be Apache Commons).
>>>
>>>
>>> Apache Commons is a collection of *Java* projects, so Chimera as a 
>>> JNI-based library isn't a natural fit.
>>>
>>
>> Could very well be that Apache Commons's charter would preclude Chimera.
>> You probably know better than I do about that.
>>
>>
>>> Furthermore, Apache Commons doesn't
>>> have its own security list so problems will go to the generic 
>>> secur...@apache.org.
>>>
>>
>> That seems easy enough to remedy, if they wanted to, and besides I'm not 
>> sure why that would influence this discussion. In my experience projects 
>> that don't have a separate security@project.a.o mailing list tend to just 
>> handle security issues on their private@project.a.o mailing list, which 
>> seems fine to me.
>>
>>
>>>
>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>
>>
>> I'm certainly not at all wedded to Apache Commons, that just seemed like a 
>> natural place to put it to me. Could be that a brand new TLP might make more 
>> sense.
>>
>> I *do* think that if other non-Hadoop projects want to make use of Chimera, 
>> which as I understand it is the goal which started this thread, then Chimera 
>> should exist outside of Hadoop so that:
>>
>> a) Projects that have nothing to do with Hadoop can just depend directly on 
>> Chimera, which has nothing Hadoop-specific in there.
>>
>> b) The Hadoop project doesn't have to export/maintain/concern itself with 
>> yet another publicly-consumed interface.
>>
>> c) Chimera can have its own (presumably much faster) release cadence 
>> completely separate from Hadoop.
>>
>> --
>> Aaron T. Myers
>> Software Engineer, Cloudera

RE: Hadoop encryption module as Apache Chimera incubator project

Reply via email to