RE: Hadoop encryption module as Apache Chimera incubator project

Zheng, Kai Wed, 03 Feb 2016 16:49:40 -0800

I thought this discussion would switch to common-dev@ now?

>> Would it make sense to also package some of the compression libraries, and 
>> maybe some of the text processing from MapReduce? Evolving some of this code 
>> to a common library with few/no dependencies would be generally useful. As a 
>> subproject, it could have a broader scope that could evolve into a viable 
>> TLP.


Sounds like a great idea to make the potential TLP more sense!! I thought it 
could be organized like in Apache common, the security, compression and other 
common text related things could be organized in different independent modules. 
Perhaps Hadoop conf could also be considered. These modules could rely on some 
common utility module. It can still be Hadoop background or powered, and 
eventually we would have a good place for some Hadoop common codes to move into 
to benefit and impact even more broad scope than Hadoop itself.

Regards,
Kai

-----Original Message-----
From: Chris Douglas [mailto:[email protected]] 
Sent: Thursday, February 04, 2016 7:26 AM
To: [email protected]
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

I went through the repository, and now understand the reasoning that would 
locate this code in Apache Commons. This isn't proposing to extract much of the 
implementation and it takes none of the integration. It's limited to interfaces 
to crypto libraries and streams/configuration. It might be a reasonable fit for 
commons-codec, but that's a pretty sparse library and driving the release 
cadence might be more complicated. It'd be worth discussing on their lists 
(please also CC common-dev@).

Chimera would be a boutique TLP, unless we wanted to draw out more of the 
integration and tooling. Is that a goal you're interested in pursuing? There's 
a tension between keeping this focused and including enough functionality to 
make it viable as an independent component. By way of example, Hadoop's common 
project requires too many dependencies and carries too much historical baggage 
for other projects to rely on.
I agree with Colin/Steve: we don't want this to grow into another guava-like 
dependency that creates more work in conflicts than it saves in 
implementation...

Would it make sense to also package some of the compression libraries, and 
maybe some of the text processing from MapReduce? Evolving some of this code to 
a common library with few/no dependencies would be generally useful. As a 
subproject, it could have a broader scope that could evolve into a viable TLP. 
If the encryption libraries are the only ones you're interested in pulling out, 
then Apache Commons does seem like a better target than a separate project. -C


On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <[email protected]> wrote:
> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
> <[email protected]> wrote:
>>>Standing in the point of shared fundamental piece of code like this, 
>>>I do think Apache Commons might be the best direction which we can 
>>>try as the first effort. In this direction, we still need to work 
>>>with Apache Common community for buying in and accepting the proposal.
>> Make sense.
>
> Makes sense how?
>
>> For this we should define the independent release cycles for this 
>> project and it would just place under Hadoop tree if we all conclude 
>> with this option at the end.
>
> Yes.
>
>> [Chris]
>>>If Chimera is not successful as an independent project or stalls, 
>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>maintainers.
>>>
>> I am not so strong on this point. If we assume project would be 
>> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
>> But if other projects depending on this piece then they would get 
>> less support. Of course right now we feel this piece of code is very 
>> important and we feel(expect) it can be successful as independent 
>> project, irrespective of whether it as separate project outside hadoop or 
>> inside.
>> So, I feel this point would not really influence to judge the discussion.
>
> Sure; code can idle anywhere, but that wasn't the point I was after.
> You propose to extract code from Hadoop, but if Chimera fails then 
> what recourse do we have among the other projects taking a dependency 
> on it? Splitting off another project is feasible, but Chimera should 
> be sustainable before this PMC can divest itself of responsibility for 
> security libraries. That's a pretty low bar.
>
> Bundling the library with the jar is helpful; I've used that before.
> It should prefer (updated) libraries from the environment, if 
> configured. Otherwise it's a pain (or impossible) for ops to patch 
> security bugs. -C
>
>>>-----Original Message-----
>>>From: Colin P. McCabe [mailto:[email protected]]
>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>To: [email protected]
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>project
>>>
>>>It's great to see interest in improving this functionality.  I think 
>>>Chimera could be successful as an Apache project.  I don't have a 
>>>strong opinion one way or the other as to whether it belongs as part 
>>>of Hadoop or separate.
>>>
>>>I do think there will be some challenges splitting this functionality 
>>>out into a separate jar, because of the way our CLASSPATH works right now.
>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>fighting it out on the classpath, similar to the situation with Guava 
>>>and other libraries.  Perhaps if Chimera adopts a policy of strong 
>>>backwards compatibility, we can just always use the latest jar, but 
>>>it still seems likely that there will be problems.  There are various 
>>>classpath isolation ideas that could help here, but they are big 
>>>projects in their own right and we don't have a clear timeline for 
>>>them.  If this does end up being a separate jar, we may need to shade 
>>>it to avoid all these issues.
>>>
>>>Bundling the JNI glue code in the jar itself is an interesting idea, 
>>>which we have talked about before for libhadoop.so.  It doesn't 
>>>really have anything to do with the question of TLP vs. non-TLP, of course.
>>>We could do that refactoring in Hadoop itself.  The really 
>>>complicated part of bundling JNI code in a jar is that you need to 
>>>create jars for every cross product of (JVM version, openssl version, 
>>>operating system).
>>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>then you might need to rebuild.  And certainly using Ubuntu would be 
>>>a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>philosophy of pulling prebuilt jars from the internet.
>>>
>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>libraries is a good one.  Given the high rate of new vulnerabilities 
>>>discovered in that library, it seems like bundling would require 
>>>Hadoop users and vendors to update very frequently, much more 
>>>frequently than Hadoop is traditionally updated.  So probably we would not 
>>>choose to bundle openssl.
>>>
>>>best,
>>>Colin
>>>
>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <[email protected]>
>>>wrote:
>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>> There's also no reason why it should maintain dependencies on other 
>>>> parts of Hadoop, if those are separable. How is this solution 
>>>> inadequate?
>>>>
>>>> If Chimera is not successful as an independent project or stalls, 
>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>> maintainers. Projects have high mortality in early life, and a 
>>>> fight over inheritance/maintenance is something we'd like to avoid. 
>>>> If, on the other hand, it develops enough of a community where it 
>>>> is obviously viable, then we can (and should) break it out as a TLP 
>>>> (as we have before). If other Apache projects take a dependency on 
>>>> Chimera, we're open to adding them to security@hadoop.
>>>>
>>>> Unlike Yetus, which was largely rewritten right before it was made 
>>>> into a TLP, security in Hadoop has a complicated pedigree. If 
>>>> Chimera eventually becomes a TLP, it seems fair to include those 
>>>> who work on it while it is a subproject. Declared upfront, that 
>>>> criterion is fairer than any post hoc justification, and will lead 
>>>> to a more accurate account of its community than a subset of the 
>>>> Hadoop PMC/committers that volunteer. -C
>>>>
>>>>
>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>><[email protected]>
>>>>wrote:
>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>discussions.
>>>>>
>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chen, Haifeng [mailto:[email protected]]
>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>> To: [email protected]
>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think  
>>>>>>>that moving core components out of Hadoop is bad from a project 
>>>>>>>management perspective.
>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that 
>>>>>>should really influence whether or not the non-Hadoop-specific 
>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>added that as a dependency and been done with it. I don't think we 
>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>
>>>>> Agree with ATM. I want to also make an additional clarification. I 
>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>While this effort is to put common and shared encryption routines 
>>>>>such as crypto stream implementations into a scope which can be 
>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop 
>>>>>encryption out of Hadoop (that is not possible).
>>>>>
>>>>> Agree if we make it a separate and independent releases project in 
>>>>>Hadoop takes a step further than the existing approach and solve 
>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I 
>>>>>think it is not the best option we can try. I also expect that an 
>>>>>independent release project within Hadoop core will also complicate 
>>>>>the existing release ideology of Hadoop release.
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Aaron T. Myers [mailto:[email protected]]
>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>> To: [email protected]
>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>><[email protected]>
>>>>>wrote:
>>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>management perspective.
>>>>>>
>>>>>
>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>HDFS,  YARN,
>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>really influence whether or not the non-Hadoop-specific encryption 
>>>>>routines should be part of the Hadoop code base, or part of the 
>>>>>code base of another project that Hadoop depends on. If Chimera had 
>>>>>existed as a library hosted at ASF when HDFS encryption was first 
>>>>>developed, HDFS probably would have just added that as a dependency 
>>>>>and been done with it. I don't think we would've copy/pasted the 
>>>>>code for Chimera into the Hadoop code base.
>>>>>
>>>>>
>>>>>> To put it another way, a bug in the encryption routines will 
>>>>>> likely become a security problem that security@hadoop needs to hear 
>>>>>> about.
>>>>>>
>>>>> I don't think
>>>>>> adding a separate project in the middle of that communication 
>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>problems, and so on...
>>>>>>
>>>>>
>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>depends upon? If the commons-httpclient library (or commons-codec, 
>>>>>or commons-io, or guava, or...) has a security vulnerability, we 
>>>>>need to know about it so that we can update our dependency to a fixed 
>>>>>version.
>>>>>This case doesn't seem materially different than that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>
>>>>>>
>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as 
>>>>>> a JNI-based library isn't a natural fit.
>>>>>>
>>>>>
>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>Chimera.
>>>>> You probably know better than I do about that.
>>>>>
>>>>>
>>>>>> Furthermore, Apache Commons doesn't have its own security list so 
>>>>>> problems will go to the generic [email protected].
>>>>>>
>>>>>
>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>not sure why that would influence this discussion. In my experience
>>>>>projects that don't have a separate [email protected] mailing list
>>>>>tend to just handle security issues on their [email protected]
>>>>>mailing list, which seems fine to me.
>>>>>
>>>>>
>>>>>>
>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>
>>>>>
>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>might make more sense.
>>>>>
>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>Chimera, which as I understand it is the goal which started this
>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>
>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>
>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>with yet another publicly-consumed interface.
>>>>>
>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>completely separate from Hadoop.
>>>>>
>>>>> --
>>>>> Aaron T. Myers
>>>>> Software Engineer, Cloudera
>>

RE: Hadoop encryption module as Apache Chimera incubator project

Reply via email to