Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-26 Thread Roy T. Fielding
On Jun 24, 2011, at 10:08 PM, Doug Cutting wrote:

> On 06/24/2011 07:07 PM, Owen O'Malley wrote:
>> On Jun 24, 2011, at 6:43 AM, Doug Cutting wrote:
>> 
>>> Might it be better to improve the existing Apache trademark policy
>>> page?
>> 
>> When the project is having trouble agreeing, reaching agreement at
>> the foundation level seems unrealistic.
> 
> ASF trademark policy is set by Shane, VP Trademark, not by a committee.

If we apply trademark policy to this discussion, then the only answer possible
is that only releases made by the Apache Hadoop PMC can be called Hadoop.
That is, after all, the essence of board delegation to PMCs and the meaning
of trademarks.

Traditionally, we have also allowed distributions that apply
released security patches, for example as found in 

   http://www.apache.org/dist/httpd/patches/

and turned a blind eye toward changes that are purely to port to a new platform.
I did not write those exceptions down because I don't know what (if any) impact
they might have on enforcement.

I said before that we typically don't argue about distributions that include
revisions that are on a release branch, but that assumed the project is
actually working toward a release of that branch.  I have a hard time believing
that Hadoop's trunk is a release branch.  In any case, this very specific
exception should be entirely decided by the project -- the VP of Trademarks
has no role in deciding what is the purview of each PMC, namely the decision
on what is or is not released in the name of that project.

Roy



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-24 Thread Doug Cutting
On 06/24/2011 07:07 PM, Owen O'Malley wrote:
> On Jun 24, 2011, at 6:43 AM, Doug Cutting wrote:
> 
>> Might it be better to improve the existing Apache trademark policy
>> page?
> 
> When the project is having trouble agreeing, reaching agreement at
> the foundation level seems unrealistic.

ASF trademark policy is set by Shane, VP Trademark, not by a committee.

Doug


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-24 Thread Owen O'Malley

On Jun 24, 2011, at 6:43 AM, Doug Cutting wrote:

> Might it be better to improve the existing Apache trademark policy page?

When the project is having trouble agreeing, reaching agreement at the 
foundation level seems unrealistic. Let's reach a workable solution for Hadoop, 
see how it functions in practice, iterate and improve, and then we can consider 
pushing it to the entire foundation.

-- Owen

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-24 Thread Doug Cutting
On 06/24/2011 10:26 AM, Owen O'Malley wrote:
> Having a clearly stated trademark statement on the website will help
> significantly with contacting organizations that are mis-using the
> trademark, so I don't want to postpone this too long. Let's discuss
> it for a week and then call a new vote if we think that is merited.

Might it be better to improve the existing Apache trademark policy page?

http://www.apache.org/foundation/marks/

This way all projects can benefit, e.g., Pig, Hive, Zookeeper, etc.

We might, e.g., propose more examples of acceptable and not-acceptable
uses of Apache marks there, etc.  We can work with trademarks@ to build
a library of boilerplate letters to be sent to folks whose use Apache
marks in objectionable ways.

Doug


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-24 Thread Owen O'Malley

On Jun 14, 2011, at 3:56 PM, Owen O'Malley wrote:

> All,
>Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.

Binding +1:  Arun, Chris, Ian, Owen
Binding -1: Doug, Eli, Todd

Non-binding +1: Allen, Cos, Matt, Steve
Non-binding -1: Ted, Jeff

Well, technically this passed, but we've been encouraged to discuss it more. 
Personally, I'd love to get more feedback from Larry about how we should 
accomplish the goal of getting packagers to either use Apache Hadoop releases 
or only use "powered by Hadoop." Clearly there is a significant difference of 
opinion about the value of that goal that is unlikely to be resolved by debate.

Having a clearly stated trademark statement on the website will help 
significantly with contacting organizations that are mis-using the trademark, 
so I don't want to postpone this too long. Let's discuss it for a week and then 
call a new vote if we think that is merited.

-- Owen

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-22 Thread Eric Baldeschwieler
I agree with this.

We need to find a middle ground that achieves three aims:

1) Makes it clear that an ASF release of Hadoop is THE APACHE HADOOP.  Jeff's 
manpower argument actually reinforces this.  We need a very testable definition 
of what is an Apache Hadoop Release or enforcement will be impossible because 
each test of the policy might require a visit to the supreme court.  It's MD5 
matches the MD5 of an apache release is a clear definition.

2) We need a proposal for derived products that vendors feel are branding 
friendly.  These should be clear enough that users understand the difference 
between a product that packages Apache Hadoop (MD5 test), one that is 
completely open source under the Apache license (easy to test) and one that 
simply uses some subset of the code under a more restrictive license or closed 
source.

3) Compatibility: I think it would be great to harness all this energy around 
compatibility to start a compatibility suite inside the Apache Hadoop project.  
Then we could define compatible with Apache Hadoop in a clear way controlled by 
the Apache Hadoop PMC.  With luck vendors on both sides of the debate will be 
incentivized to contribute to the project this way.  Such a suite would also 
prove useful to the developers of Apache Hadoop.

E14

On Jun 20, 2011, at 10:09 AM, Ted Dunning wrote:

> Great summary Andrew.
> 
> I would add one more precipitating factor here.  That is the arrival of a
> number of products which are very close to the Apache version of Hadoop but
> for which there is no good and widely accepted terminology that gives proper
> credit to their lineage while making clear the distinction from bit-for-bit
> copies of official Apache releases.
> 
> Some products are analogous to hive, pig or hbase in that they are
> independent systems that run ON hadoop (or close equivalents).  These have
> no terminology problem because these products aren't hadoop, but rather use
> hadoop.
> 
> Other products contain Hadoop internally as a critical component but do not
> necessarily expose Hadoop capabilities to the end user (I can't name these
> products, but they exist).  These products have little nomenclatural
> difficulty because the powerd-by-Hadoop description fits very well.
> 
> The products with the terminology problem are the ones that are add either
> curation and packaging (Cloudera) or substantial additional performance
> enhancing components (MapR).  These products are upwardly compatible with
> Apache Hadoop in that programs that run on Hadoop will very probably run on
> these Hadoop-like systems.  The problem is that there is no good term for
> these products.  They may even contain components that are bit-for-bit
> identical to the same components for Apache releases.  It is fair to say
> that these are not Apache released software, but it is also fair to say that
> there ought to be a better name for the class of these products.
> 
> On Mon, Jun 20, 2011 at 4:39 PM, Andrew Purtell  wrote:
> 
>> Hadoop I think needs to be more careful. What triggered this discussion is
>> the arrival of new players releasing products they call Hadoop but
>> containing severe changes the community, by way of the ASF umbrella we all
>> work under, had nothing to do with designing or developing. And some of
>> these are being open sourced as a Hadoop. There is no Linus here. Which of
>> these is _the_ Hadoop? As a would-be contributor, which should I select?
>> 



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-20 Thread Steve Loughran

On 17/06/2011 19:17, Konstantin Boudnik wrote:

On Fri, Jun 17, 2011 at 12:01PM, Steve Loughran wrote:

On 15/06/11 16:58, Konstantin Boudnik wrote:

On Wed, Jun 15, 2011 at 02:52, Steve Loughran   wrote:

also: banners, stickers and clothing? Can I have T-shirts saying "I broke
the hadoop build" with the logo on, or should it be "I broke the Apache
Hadoop build"?


I think such a T-shirt should be forcefully worn on any person who did
just that.


Here you go with the poster:
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/doc/breaking_the_hadoop_build.odp?revision=8630

I can add it to hadoop-common SVN for people to work on...


Please do, by all means :) !


https://issues.apache.org/jira/browse/HADOOP-7406

now, what happens if it gets checked in in a way that breaks the build? 
That would be too much.


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-18 Thread Shane Curcuru
One clarification: I've only had time to review the wiki document for 
some terminology updates, and not for the overall content yet.  So from 
the trademarks@ point of view, more review is needed before we work on 
making this official.


From the significant amount of discussion in this vote thread, I think 
it might be good to have the Hadoop PMC and trademarks@ work on getting 
a more organized consensus first, before voting on an updated proposed 
Hadoop policy.


- Shane

Owen O'Malley wrote:

All,
   Steve Loughran has done some great work on defining what can be 
called Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop 
. After some cleanup from 
Noirin and Shane, I think we've got a really good base. I'd like a vote 
to approve the content (at the current revision 12) and put the content 
on our web site.


Clearly, I'm +1.

-- Owen




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-17 Thread Konstantin Boudnik
On Fri, Jun 17, 2011 at 12:01PM, Steve Loughran wrote:
> On 15/06/11 16:58, Konstantin Boudnik wrote:
>> On Wed, Jun 15, 2011 at 02:52, Steve Loughran  wrote:
>
>>>
>>> Regarding the vote, I think the discussion here is interesting and should be
>>> finalised before the vote. It's worth resolving the issues.
>>>
>>> also: banners, stickers and clothing? Can I have T-shirts saying "I broke
>>> the hadoop build" with the logo on, or should it be "I broke the Apache
>>> Hadoop build"?
>>
>> I think such a T-shirt should be forcefully worn on any person who did
>> just that.
>
> Here you go with the poster:
> http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/doc/breaking_the_hadoop_build.odp?revision=8630
>
> I can add it to hadoop-common SVN for people to work on...

Please do :)!



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-17 Thread Konstantin Boudnik
On Fri, Jun 17, 2011 at 12:01PM, Steve Loughran wrote:
> On 15/06/11 16:58, Konstantin Boudnik wrote:
>> On Wed, Jun 15, 2011 at 02:52, Steve Loughran  wrote:
>
>>>
>>> Regarding the vote, I think the discussion here is interesting and should be
>>> finalised before the vote. It's worth resolving the issues.
>>>
>>> also: banners, stickers and clothing? Can I have T-shirts saying "I broke
>>> the hadoop build" with the logo on, or should it be "I broke the Apache
>>> Hadoop build"?
>>
>> I think such a T-shirt should be forcefully worn on any person who did
>> just that.
>
> Here you go with the poster:
> http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/doc/breaking_the_hadoop_build.odp?revision=8630
>
> I can add it to hadoop-common SVN for people to work on...

Please do, by all means :) !


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-17 Thread Steve Loughran

On 15/06/11 16:58, Konstantin Boudnik wrote:

On Wed, Jun 15, 2011 at 02:52, Steve Loughran  wrote:




Regarding the vote, I think the discussion here is interesting and should be
finalised before the vote. It's worth resolving the issues.

also: banners, stickers and clothing? Can I have T-shirts saying "I broke
the hadoop build" with the logo on, or should it be "I broke the Apache
Hadoop build"?


I think such a T-shirt should be forcefully worn on any person who did
just that.


Here you go with the poster:
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/doc/breaking_the_hadoop_build.odp?revision=8630

I can add it to hadoop-common SVN for people to work on...


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Eric Baldeschwieler
If the board does have a stance, I'd love to hear it.  That could usefully end 
this discussion.

Absent that, it seems reasonable for the PMC to make a decision in this area.  
Each project has different use cases and ecosystems, so it may not be 
reasonable to expect a one size fits all solution.  I see no reason not to make 
a local proposal, the board can always clarify.

On Jun 16, 2011, at 11:11 AM, Eli Collins wrote:

> On Thu, Jun 16, 2011 at 10:38 AM, Matthew Foley  wrote:
>> After writing my note to Eric, I realize that Eli and I are guilty of the 
>> same attempt
>> to use legal terminology in an engineering context.  Craig Russell is 
>> absolutely right.
>> If you change one bit, it is a "derived work".
>> 
>> However, we can still allow the trademark to be applied to that work, if it
>> meets licensing criteria.  So what we are arguing about is, "Where is the 
>> boundary
>> line between something we are willing to call 'Apache Hadoop' and something
>> that must be called 'Product XYZ Powered by Apache Hadoop'?"
>> 
>> I'm in favor of a very strict definition.  It needs to be really, really 
>> close to a
>> PMC-approved release.  But I'm open to the argument that a small number
>> of security patches could be necessary for a viable commercial product,
>> and that shouldn't necessarily prevent it from using the trademark.
>> 
>> But I suggest we stop focusing on the term "derived work".  Note that the
>> "Defining Apache Hadoop" draft document we are voting on doesn't use
>> that term.
> 
> See the section titled "Derivative Works". The term "derivative work"
> is used throughout the document.  I think you're right that the key
> point here is not what is and is not a derivative work, but what can
> be called Hadoop.
> 
> Seems like the board should have an ASF-wide stance on what can be
> called Apache X instead of doing this per-project.
> 
> Thanks,
> Eli



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Eli Collins
On Thu, Jun 16, 2011 at 10:38 AM, Matthew Foley  wrote:
> After writing my note to Eric, I realize that Eli and I are guilty of the 
> same attempt
> to use legal terminology in an engineering context.  Craig Russell is 
> absolutely right.
> If you change one bit, it is a "derived work".
>
> However, we can still allow the trademark to be applied to that work, if it
> meets licensing criteria.  So what we are arguing about is, "Where is the 
> boundary
> line between something we are willing to call 'Apache Hadoop' and something
> that must be called 'Product XYZ Powered by Apache Hadoop'?"
>
> I'm in favor of a very strict definition.  It needs to be really, really 
> close to a
> PMC-approved release.  But I'm open to the argument that a small number
> of security patches could be necessary for a viable commercial product,
> and that shouldn't necessarily prevent it from using the trademark.
>
> But I suggest we stop focusing on the term "derived work".  Note that the
> "Defining Apache Hadoop" draft document we are voting on doesn't use
> that term.

See the section titled "Derivative Works". The term "derivative work"
is used throughout the document.  I think you're right that the key
point here is not what is and is not a derivative work, but what can
be called Hadoop.

Seems like the board should have an ASF-wide stance on what can be
called Apache X instead of doing this per-project.

Thanks,
Eli


RE: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Lawrence Rosen
I'm very confused by this thread. What does trademark law have to do with
derivative work analysis under copyright law? Is there something specific in
our FAQ or trademark policy that confuses these concepts and that we should
clean up?

/Larry

Please cc: trademarks@ because I'm not on the other lists.


> -Original Message-
> From: Eli Collins [mailto:e...@cloudera.com]
> Sent: Thursday, June 16, 2011 9:05 AM
> To: Matthew Foley
> Cc: general@hadoop.apache.org; tradema...@apache.org
> Subject: Re: [VOTE] Shall we adopt the "Defining Hadoop" page
> 
> On Wed, Jun 15, 2011 at 6:17 PM, Matthew Foley 
> wrote:
> > I tend to agree with what I think you are saying, that
> >        * applying a small-number-of-patches that are
> >        * for high-severity-bug-fixes, and
> >        * have been Apache-Hadoop-committed
> > to an Apache Hadoop release should not demote the result to a
> "derived work".
> > However, if so many patches are applied that the result cannot be
> meaningfully
> > correlated with a specific Apache Hadoop release, then it probably
> has
> > become a derived work.
> >
> 
> This is one reason why I think the definition of derived work in the
> draft of the wiki is way too broad. Something that's nothing like
> Hadoop at all but includes a Hadoop jar is given the same label as
> something with a single security patch. I think we can come up with a
> more useful definition of derived work. If we do that would help us
> draw the distinction between:
> 1. An Apache Hadoop release voted on the PMC, bit-for-bit identical
> 2. An Apache Hadoop release + backports (eg say per the above
> definition of backport)
> 3. Something that is powered by Hadoop (eg HBase)
> 4. Something that is not Hadoop nor powered by Hadoop (eg the way tc
> Server is not powered by Apache Tomcat)
> 
> Note that the current document does not make an exception for security
> patches. I and Owen made this suggestion on this thread but the
> writeup we are voting on makes no such exception.
> 
> > But how do we draw a meaningful line across that big gray area?
>  That's why I'd like to
> > see specific text from one of the other projects you cited as an
> example.
> >
> 
> Googling didn't turn up anything in their public archives. This was in
> an email exchange I had with Shane several years ago. Hopefully their
> PMC can chime in.
> 
> Thanks,
> Eli




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Matthew Foley
After writing my note to Eric, I realize that Eli and I are guilty of the same 
attempt
to use legal terminology in an engineering context.  Craig Russell is 
absolutely right.
If you change one bit, it is a "derived work".

However, we can still allow the trademark to be applied to that work, if it 
meets licensing criteria.  So what we are arguing about is, "Where is the 
boundary
line between something we are willing to call 'Apache Hadoop' and something
that must be called 'Product XYZ Powered by Apache Hadoop'?"

I'm in favor of a very strict definition.  It needs to be really, really close 
to a
PMC-approved release.  But I'm open to the argument that a small number
of security patches could be necessary for a viable commercial product,
and that shouldn't necessarily prevent it from using the trademark.

But I suggest we stop focusing on the term "derived work".  Note that the 
"Defining Apache Hadoop" draft document we are voting on doesn't use 
that term.

--Matt


On Jun 16, 2011, at 9:05 AM, Eli Collins wrote:

On Wed, Jun 15, 2011 at 6:17 PM, Matthew Foley  wrote:
> I tend to agree with what I think you are saying, that
>* applying a small-number-of-patches that are
>* for high-severity-bug-fixes, and
>* have been Apache-Hadoop-committed
> to an Apache Hadoop release should not demote the result to a "derived work".
> However, if so many patches are applied that the result cannot be meaningfully
> correlated with a specific Apache Hadoop release, then it probably has
> become a derived work.
> 

This is one reason why I think the definition of derived work in the
draft of the wiki is way too broad. Something that's nothing like
Hadoop at all but includes a Hadoop jar is given the same label as
something with a single security patch. I think we can come up with a
more useful definition of derived work. If we do that would help us
draw the distinction between:
1. An Apache Hadoop release voted on the PMC, bit-for-bit identical
2. An Apache Hadoop release + backports (eg say per the above
definition of backport)
3. Something that is powered by Hadoop (eg HBase)
4. Something that is not Hadoop nor powered by Hadoop (eg the way tc
Server is not powered by Apache Tomcat)

Note that the current document does not make an exception for security
patches. I and Owen made this suggestion on this thread but the
writeup we are voting on makes no such exception.

> But how do we draw a meaningful line across that big gray area?  That's why 
> I'd like to
> see specific text from one of the other projects you cited as an example.
> 

Googling didn't turn up anything in their public archives. This was in
an email exchange I had with Shane several years ago. Hopefully their
PMC can chime in.

Thanks,
Eli



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Matthew Foley
Hi Eric,
sorry, but drawing a distinction between "Hadoop" and "Apache Hadoop" cannot be 
done, under general trademark usage nor the Apache Trademark Policy.  Trademark 
usage is a specialized language just like a programming language, and that 
usage violates the intended semantics of the trademark.

--Matt


On Jun 15, 2011, at 11:35 PM, Eric Sammer wrote:

On Wed, Jun 15, 2011 at 9:47 PM, Ian Holsman 
mailto:had...@holsman.net>> wrote:

so yes .. even a simple patch makes it derived, because it is different.

...and a "dervied work" is fine. Nothing inherently wrong with the term 
derived. I think the question is can one call it "Hadoop?" Note I'm *not* 
saying "Apache Hadoop," just "Hadoop" when the derived work is actually derived 
(to any degree, as Craig R pointed out). Apache Hadoop always and forever means 
the bits voted on by the PMC - no vendor can claim that - but there does appear 
to be plenty of prior examples of "reasonable" use of ASF (and other OSS 
organization) project names in clearly derived works. I do agree there should 
be a policy and it needs to be universally applied to be fair to all involved.

Not to kick up the compatibility dust storm again, but people will always claim 
crazy stuff that may or may not be true. We should just ignore it. Any day of 
the week someone is claiming XYZ compatible either explicitly or implicitly (as 
in client libraries for Foo Project). For cases where a vendor makes a claim 
that isn't true, users will ask, we'll clarify that Apache makes no guarantees 
of derived work compatibility and doesn't certify anything (and specifically 
does the opposite - *NO* guarantees or warranties).

Example uses I think should be fine / acceptable:

YDH (even though it no longer exists, it's a good example) and Y!'s use of 
Hadoop
Facebook Hadoop
Hadoop at eBay
Hadoop at LinkedIn
IBM's use of Hadoop
and yes, CDH*

Even if some / all of the above modify at least a single bit (and may 
*technically* be derived works) everyone understands what they mean. As for the 
confusion, the OSS community has always just said "oh, they patch some stuff, 
you should probably ask them" when confronted with vendor modified versions of 
upstream projects; I've been involved in many of those upstream projects, 
including a Linux distro (downstream). We should always be polite to downstream 
users in redirecting them, but I think redirecting them is fine. It's not 
confusing to users in my experience (we can make it a FAQ or something and just 
point people there) as RedHat, Novell, Oracle, IBM, and many other vendors have 
been happily[1] coexisting with their upstream counterparts for a long time.

I believe we (the collective Apache Hadoop community including those that 
redistribute Hadoop bits in various forms) should focus on producing regular, 
quality releases in a cooperative and constructive environment, and continue to 
require vendors to provide the proper attribution and license information. This 
is in everyone's interest, vendors and direct users alike.

*Disclosure: I work for Cloudera and I think this should apply to anyone and 
everyone, including my employer (with whom I obviously do not clear emails. :))

[1] OK, maybe not always "happily" but mostly so. You know what I mean.

Thanks to Steve L and others for their hard work on this one.
(Sorry for the long email.)

--
Eric Sammer
twitter: esammer
data: www.cloudera.com



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Eli Collins
On Wed, Jun 15, 2011 at 6:17 PM, Matthew Foley  wrote:
> I tend to agree with what I think you are saying, that
>        * applying a small-number-of-patches that are
>        * for high-severity-bug-fixes, and
>        * have been Apache-Hadoop-committed
> to an Apache Hadoop release should not demote the result to a "derived work".
> However, if so many patches are applied that the result cannot be meaningfully
> correlated with a specific Apache Hadoop release, then it probably has
> become a derived work.
>

This is one reason why I think the definition of derived work in the
draft of the wiki is way too broad. Something that's nothing like
Hadoop at all but includes a Hadoop jar is given the same label as
something with a single security patch. I think we can come up with a
more useful definition of derived work. If we do that would help us
draw the distinction between:
1. An Apache Hadoop release voted on the PMC, bit-for-bit identical
2. An Apache Hadoop release + backports (eg say per the above
definition of backport)
3. Something that is powered by Hadoop (eg HBase)
4. Something that is not Hadoop nor powered by Hadoop (eg the way tc
Server is not powered by Apache Tomcat)

Note that the current document does not make an exception for security
patches. I and Owen made this suggestion on this thread but the
writeup we are voting on makes no such exception.

> But how do we draw a meaningful line across that big gray area?  That's why 
> I'd like to
> see specific text from one of the other projects you cited as an example.
>

Googling didn't turn up anything in their public archives. This was in
an email exchange I had with Shane several years ago. Hopefully their
PMC can chime in.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Eli Collins
On Thu, Jun 16, 2011 at 8:02 AM, Owen O'Malley  wrote:
> On Wed, Jun 15, 2011 at 11:35 PM, Eric Sammer  wrote:
>
> I think the question is can one call it "Hadoop?" Note I'm *not*
>> saying "Apache Hadoop," just "Hadoop" when the derived work is actually
>> derived (to any degree, as Craig R pointed out). Apache Hadoop always and
>> forever means the bits voted on by the PMC - no vendor can claim that - but
>> there does appear to be plenty of prior examples of "reasonable" use of ASF
>> (and other OSS organization) project names in clearly derived works.
>>
>
> Thank you, Eric, for demonstrating why we are fixing it. Apache owns the
> Hadoop trademark. Hadoop is PRECISELY the same as Apache Hadoop. They are
> two names for the same thing. If the Hadoop PMC were to fail to enforce
> that, the Apache board would remove us en masse from the PMC.
>

By this logic the Apache board should en masse remove the PMC from the
HTTP Server, Subversion and Tomcat because they've failed to enforce
Red Hat, Novell, Ubuntu and others to stop calling them Apache X.
Clearly that hasn't happened. Let's let the Apache board speak for
itself.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Eli Collins
On Thu, Jun 16, 2011 at 7:48 AM, Owen O'Malley  wrote:
> On Wed, Jun 15, 2011 at 9:24 PM, Rottinghuis, Joep 
> wrote:
>
> It does make sense to me to distinguish between the case when a company
>> seeks to benefit from using the Hadoop name for their product and the case
>> when a company uses Hadoop internally with some minor patches.
>>
>
> If they aren't distributing the version that they use, no one will know or
> care if they have patches applied. Eli is just trying to cloud the real
> issue, which is about distributors and what they call
> their derivative works.
>

I truly don't see distribution as the relevant issue, in particular I
don't see why the definition of what Hadoop should change on whether
or not you distribute it.

> For example: large company creates a game-show playing appliance and
>> explains that they have used Hadoop for some of the learning tasks. Not
>> allowed if they applied more than 3 patches?
>>
>
> Of course it is allowed. It is only a question of whether you can distribute
> it to others and call it Hadoop.
>

So you want IBM to call what they run Hadoop, unless they put it up on
a website in which case they can no longer call it Hadoop. What is the
rationale?

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Owen O'Malley
On Wed, Jun 15, 2011 at 11:35 PM, Eric Sammer  wrote:

I think the question is can one call it "Hadoop?" Note I'm *not*
> saying "Apache Hadoop," just "Hadoop" when the derived work is actually
> derived (to any degree, as Craig R pointed out). Apache Hadoop always and
> forever means the bits voted on by the PMC - no vendor can claim that - but
> there does appear to be plenty of prior examples of "reasonable" use of ASF
> (and other OSS organization) project names in clearly derived works.
>

Thank you, Eric, for demonstrating why we are fixing it. Apache owns the
Hadoop trademark. Hadoop is PRECISELY the same as Apache Hadoop. They are
two names for the same thing. If the Hadoop PMC were to fail to enforce
that, the Apache board would remove us en masse from the PMC.

-- Owen


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Owen O'Malley
On Wed, Jun 15, 2011 at 9:24 PM, Rottinghuis, Joep wrote:

It does make sense to me to distinguish between the case when a company
> seeks to benefit from using the Hadoop name for their product and the case
> when a company uses Hadoop internally with some minor patches.
>

If they aren't distributing the version that they use, no one will know or
care if they have patches applied. Eli is just trying to cloud the real
issue, which is about distributors and what they call
their derivative works.

For example: large company creates a game-show playing appliance and
> explains that they have used Hadoop for some of the learning tasks. Not
> allowed if they applied more than 3 patches?
>

Of course it is allowed. It is only a question of whether you can distribute
it to others and call it Hadoop.


> Also, if thousands of changes are packaged together into one giant patch,
> is that allowed?
>

No, the exception is strictly for critical security fixes and I would
sincerely hope that those would be released by Apache in very short order.

-- Owen


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Steve Loughran

On 16/06/11 07:35, Eric Sammer wrote:

On Wed, Jun 15, 2011 at 9:47 PM, Ian Holsman  wrote:



so yes .. even a simple patch makes it derived, because it is different.



...and a "dervied work" is fine. Nothing inherently wrong with the term
derived. I think the question is can one call it "Hadoop?" Note I'm *not*
saying "Apache Hadoop," just "Hadoop" when the derived work is actually
derived (to any degree, as Craig R pointed out). Apache Hadoop always and
forever means the bits voted on by the PMC - no vendor can claim that - but
there does appear to be plenty of prior examples of "reasonable" use of ASF
(and other OSS organization) project names in clearly derived works. I do
agree there should be a policy and it needs to be universally applied to be
fair to all involved.

Not to kick up the compatibility dust storm again, but people will always
claim crazy stuff that may or may not be true. We should just ignore it.


The issue is branding and trademarks, eventually things get downgraded 
to become meaningless. If I code an MR engine in erlang (I have one 
somewhere), can I call it "Hadoop for Erlang"?


> Any

day of the week someone is claiming XYZ compatible either explicitly or
implicitly (as in client libraries for Foo Project). For cases where a
vendor makes a claim that isn't true, users will ask, we'll clarify that
Apache makes no guarantees of derived work compatibility and doesn't certify
anything (and specifically does the opposite - *NO* guarantees or
warranties).


-BigTop could provide that defensible compatibility statement. 
"Automotive Joe's Crankshaft platform passed the Apache BigTop DFS, MR, 
Mahout and HBase test suites"




Example uses I think should be fine / acceptable:

YDH (even though it no longer exists, it's a good example) and Y!'s use of
Hadoop


-creates confusion and encourages the notion that anything is a 
distribution of hadoop, which is the situation that the trademarks 
people are trying to crack down



Facebook Hadoop


-depends on internal vs external


Hadoop at eBay
Hadoop at LinkedIn


details of internal use, as valid as "Hadoop in Steve's house", which, 
given my known network state, is always something to cherish. And while 
I have built my branch up and published it, it's no longer something I 
distribute (though it is in an open SVN repository somewhere). I'm 
working directly with Apache Hadoop 0.20.203 these days.




IBM's use of Hadoop


not sure about IBM distribution of Apache Hadoop, as I presume it has 
the uncommitted patch to work on IBM JVMs (though were someone to commit 
it..)


http://www.alphaworks.ibm.com/tech/idah

The biginsights product is more explicit and, to me, a good example of 
terminology. Their own brand, description of the benefits, and details 
on what's in there:


"IBM InfoSphere BigInsights Enterprise Edition
For turning complex, internet-scale information into insight, cost 
effectively


IBM® InfoSphere™ BigInsights Enterprise Edition enables new solutions 
that turn large, complex volumes of data into insight, cost effectively. 
InfoSphere BigInsights delivers an enterprise-ready big data solution by 
combining Apache Hadoop, including the MapReduce framework and the 
Hadoop Distributed File Systems (HDFS), with unique technologies and 
capabilities from across IBM."


That gives them the flexibility to swap things around in future (switch 
to GPFS, MapR, Brisk) without having to change their branding.



and yes, CDH*


If you look a the CDH site its now "Cloudera's Distribution including 
Apache Hadoop". After all it's Cloudera's data analysis stack including 
Apache Hadoop,





Even if some / all of the above modify at least a single bit (and may
*technically* be derived works) everyone understands what they mean. As for
the confusion, the OSS community has always just said "oh, they patch some
stuff, you should probably ask them" when confronted with vendor modified
versions of upstream projects; I've been involved in many of those upstream
projects, including a Linux distro (downstream). We should always be polite
to downstream users in redirecting them, but I think redirecting them is
fine. It's not confusing to users in my experience (we can make it a FAQ or
something and just point people there) as RedHat, Novell, Oracle, IBM, and
many other vendors have been happily[1] coexisting with their upstream
counterparts for a long time.


co-existence yes; happiness, not always:

http://www.jonobacon.org/2010/07/30/red-hat-canonical-and-gnome-contributions/
http://lwn.net/Articles/374737/
http://gburt.blogspot.com/2011/02/banshee-supporting-gnome-on-ubuntu.html
http://bazaar.launchpad.net/~mozillateam/firefox/firefox-4.0.head/view/head:/debian/patches/ubuntu-codes-amazon.patch

Where ubuntu are good is that launchpad is a good entry point for filing 
and tracking any ubuntu-related problem, and helping to push that 
upstream, so the local issue can be linked to the source issue, letting 
me deal with problems like 

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-16 Thread Doug Cutting
-1 if patches that have been committed to trunk are not permitted to
be applied to distributions that are still called "Apache Hadoop".
That's the rule we agreed on some time ago at Roy's suggestion.  Let's
first document the status quo, then, separately, discuss and vote on
changes to it.  Also, such a branding rule should probably be uniform
across Apache projects, not Hadoop-specific.

Doug

On Wed, Jun 15, 2011 at 12:56 AM, Owen O'Malley  wrote:
> All,
>    Steve Loughran has done some great work on defining what can be called
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some
> cleanup from Noirin and Shane, I think we've got a really good base. I'd
> like a vote to approve the content (at the current revision 12) and put the
> content on our web site.
> Clearly, I'm +1.
> -- Owen


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eric Sammer
On Wed, Jun 15, 2011 at 9:47 PM, Ian Holsman  wrote:

>
> so yes .. even a simple patch makes it derived, because it is different.
>

...and a "dervied work" is fine. Nothing inherently wrong with the term
derived. I think the question is can one call it "Hadoop?" Note I'm *not*
saying "Apache Hadoop," just "Hadoop" when the derived work is actually
derived (to any degree, as Craig R pointed out). Apache Hadoop always and
forever means the bits voted on by the PMC - no vendor can claim that - but
there does appear to be plenty of prior examples of "reasonable" use of ASF
(and other OSS organization) project names in clearly derived works. I do
agree there should be a policy and it needs to be universally applied to be
fair to all involved.

Not to kick up the compatibility dust storm again, but people will always
claim crazy stuff that may or may not be true. We should just ignore it. Any
day of the week someone is claiming XYZ compatible either explicitly or
implicitly (as in client libraries for Foo Project). For cases where a
vendor makes a claim that isn't true, users will ask, we'll clarify that
Apache makes no guarantees of derived work compatibility and doesn't certify
anything (and specifically does the opposite - *NO* guarantees or
warranties).

Example uses I think should be fine / acceptable:

YDH (even though it no longer exists, it's a good example) and Y!'s use of
Hadoop
Facebook Hadoop
Hadoop at eBay
Hadoop at LinkedIn
IBM's use of Hadoop
and yes, CDH*

Even if some / all of the above modify at least a single bit (and may
*technically* be derived works) everyone understands what they mean. As for
the confusion, the OSS community has always just said "oh, they patch some
stuff, you should probably ask them" when confronted with vendor modified
versions of upstream projects; I've been involved in many of those upstream
projects, including a Linux distro (downstream). We should always be polite
to downstream users in redirecting them, but I think redirecting them is
fine. It's not confusing to users in my experience (we can make it a FAQ or
something and just point people there) as RedHat, Novell, Oracle, IBM, and
many other vendors have been happily[1] coexisting with their upstream
counterparts for a long time.

I believe we (the collective Apache Hadoop community including those that
redistribute Hadoop bits in various forms) should focus on producing
regular, quality releases in a cooperative and constructive environment, and
continue to require vendors to provide the proper attribution and license
information. This is in everyone's interest, vendors and direct users alike.

*Disclosure: I work for Cloudera and I think this should apply to anyone and
everyone, including my employer (with whom I obviously do not clear emails.
:))

[1] OK, maybe not always "happily" but mostly so. You know what I mean.

Thanks to Steve L and others for their hard work on this one.
(Sorry for the long email.)

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Ian Holsman

On Jun 16, 2011, at 2:30 PM, Todd Lipcon wrote:

> On Wed, Jun 15, 2011 at 7:19 PM, Craig L Russell
> wrote:
> 
>> There's no ambiguity. Either you ship the bits that the Apache PMC has
>> voted on as a release, or you change it (one bit) and it is no longer what
>> the PMC has voted on. It's a derived work.
>> 
>> The rules for voting in Apache require that if you change a bit in an
>> artifact, you can no longer count votes for the previous artifact. Because
>> the new work is different. A new vote is required.
>> 
> 
> Sorry, but this is just silly. Are you telling me that the httpd package in
> Ubuntu isn't Apache httpd? It has 43 patches applied. Tomcat6 has 17. I'm
> sure every other commonly used piece of software bundled with ubuntu has
> been patched, too. I don't see them calling their packages "Ubuntu HTTP
> server powered by Apache HTTPD". It's just httpd.
> 

well.. for RHEL in the early days of httpd, a configuration that ran on RHEL
would not work on the 'vanilla' httpd.

(they implemented a feature called include which could take a wildcard, which 
wasn't in the released version of httpd at the time)

even today.. I can't build redis on my mac as I am using GNU's libtool instead 
of the one packaged on the mac. 
http://code.google.com/p/redis/issues/detail?id=443

so yes .. even a simple patch makes it derived, because it is different.


> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Todd Lipcon
On Wed, Jun 15, 2011 at 7:19 PM, Craig L Russell
wrote:

> There's no ambiguity. Either you ship the bits that the Apache PMC has
> voted on as a release, or you change it (one bit) and it is no longer what
> the PMC has voted on. It's a derived work.
>
> The rules for voting in Apache require that if you change a bit in an
> artifact, you can no longer count votes for the previous artifact. Because
> the new work is different. A new vote is required.
>

Sorry, but this is just silly. Are you telling me that the httpd package in
Ubuntu isn't Apache httpd? It has 43 patches applied. Tomcat6 has 17. I'm
sure every other commonly used piece of software bundled with ubuntu has
been patched, too. I don't see them calling their packages "Ubuntu HTTP
server powered by Apache HTTPD". It's just httpd.

The httpd in RHEL 5 is the same way. In fact they even provide some nice
metadata in their patches, for example:
httpd-2.0.48-release.patch:Upstream-Status: vendor-specific change
httpd-2.1.10-apctl.patch:Upstream-Status: Vendor-specific changes for better
initscript integration

To me, this is a good thing: allowing vendors to redistribute the software
with some modifications makes it much more accessible to users and
businesses alike, and that's part of why Hadoop has had so much success. So
long as we require the vendors to upstream those modifications back to the
ASF, we get the benefits of these contributions back in the community and
everyone should be happy.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


RE: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Rottinghuis, Joep
It does make sense to me to distinguish between the case when a company seeks 
to benefit from using the Hadoop name for their product and the case when a 
company uses Hadoop internally with some minor patches.

For example: large company creates a game-show playing appliance and explains 
that they have used Hadoop for some of the learning tasks. Not allowed if they 
applied more than 3 patches?

Or: company claims they have a large Hadoop deployment and are looking for 
developers to help them with their Hadoop development work is not allowed? 
What's the alternative? Wanted: Powered by Apache™ Hadoop™ developers?

Also, if thousands of changes are packaged together into one giant patch, is 
that allowed?

Perhaps a similarity index (such as used by Git to determine if two files are 
similar enough to be considered a rename) would make sense? If 98% of the code 
is the same, would it be Hadoop if used internally and not sold/marketted as a 
product?

Cheers,

Joep


From: Eli Collins [e...@cloudera.com]
Sent: Wednesday, June 15, 2011 9:23 AM
To: general@hadoop.apache.org
Cc: Apache Brand Management
Subject: Re: [VOTE] Shall we adopt the "Defining Hadoop" page

On Tue, Jun 14, 2011 at 7:46 PM, Allen Wittenauer  wrote:
>
> On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:
>> Are we really going to go after all the web companies that patch in an
>> enhancement to their current Hadoop build and tell them to stop saying
>> that they are using Hadoop?  You've patched Hadoop many times, should
>> your employer not be able to say they use Hadoop?  I'm -1 on a
>> proposal that does this.
>
>I think there is a big difference between some company that uses 
> Hadoop with some patches internally and a company that puts out a 
> distribution for others to use, usually for-profit.

The wiki makes no such distinction. The PMC will apply the rules
equally to all parties.

According to Owen's email if you are using a release of Apache Hadoop
and have applied more than 2 security patches or any backports you are
not using Hadoop.

Thanks,
Eli

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Ian Holsman
So to second a point here.

We are not saying you can't patch your distribution, add your own features, 
share it with your friends, or do whatever you want to the code. 
all we're saying is that you can't call that 'Apache Hadoop'.


On Jun 16, 2011, at 12:19 PM, Craig L Russell wrote:

> Hi Matthew,
> 
> I'm sorry I have to disagree.
> 
> If you change a bit in a work, it becomes a derived work. There's no 
> "demotion" involved. Just a definition of derived work.
> 
> There's no ambiguity. Either you ship the bits that the Apache PMC has voted 
> on as a release, or you change it (one bit) and it is no longer what the PMC 
> has voted on. It's a derived work.
> 
> The rules for voting in Apache require that if you change a bit in an 
> artifact, you can no longer count votes for the previous artifact. Because 
> the new work is different. A new vote is required.
> 
> Not gray. Black and white.
> 
> Simple as that.
> 
> Craig
> 
> P.S. for the anthropologists, look at the history of Apache Derby and Sun 
> JavaDB. Meaningful, specific example.
> 
> On Jun 15, 2011, at 6:17 PM, Matthew Foley wrote:
> 
>> I tend to agree with what I think you are saying, that
>>  * applying a small-number-of-patches that are
>>  * for high-severity-bug-fixes, and
>>  * have been Apache-Hadoop-committed
>> to an Apache Hadoop release should not demote the result to a "derived work".
>> However, if so many patches are applied that the result cannot be 
>> meaningfully
>> correlated with a specific Apache Hadoop release, then it probably has
>> become a derived work.
>> 
>> But how do we draw a meaningful line across that big gray area?  That's why 
>> I'd like to
>> see specific text from one of the other projects you cited as an example.
>> 
>> Thanks,
>> --Matt
>> 
>> 
>> On Jun 15, 2011, at 6:02 PM, Eli Collins wrote:
>> 
>> On Wed, Jun 15, 2011 at 10:44 AM, Matthew Foley  wrote:
>>> Eli, you said:
 Putting a build of Hadoop that has 4 security patches applied into the same
 category as a product that has entirely re-worked the code and not
 gotten it checked into trunk does a major disservice to the people who
 contribute to and invest in the project.
>>> 
>>> How would you phrase the distinction, so that it is clear and reasonably 
>>> unambiguous
>>> for people who are not Hadoop developers?  Do the HTTP and Subversion 
>>> policies
>>> draw this distinction, and if so could you please point us at the specific 
>>> text, or
>>> copy that text to this thread?
>>> 
>> 
>> I'll try to find it, this was told to me verbally a while back. Maybe
>> Roy can chime in.
>> 
>> Since there seems to be some confusion around distribution we should
>> make this explicit.  Some people are currently interpreting the
>> guidelines to say that if you patch an Apache Hadoop release yourself
>> then you're still running Apache Hadoop.  But if a vendor patches
>> Apache Hadoop for you then you're not running Apache Hadoop. How about
>> if a subcontractor patches Apache Hadoop for you, then is it Apache
>> Hadoop? This isn't sustainable.
>> 
>> Thanks,
>> Eli
>>> Thanks,
>>> --Matt
>>> 
>>> 
>>> On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:
>>> 
>>> On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
 
 On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
 
> Wrt derivative works, it's not clear from the document, but I think we
> should explicitly adopt the policy of HTTPD and Subversion that
> backported patches from trunk and security fixes are permitted.
 
 Actually, the document is extremely clear that only Apache releases may be 
 called Hadoop.
 
 There was a very long thread about why the rapidly expanding 
 Hadoop-ecosystem is leading to at lot of customer confusion about the 
 different "versions" of Hadoop. We as the Hadoop project don't have the 
 resources or the necessary compatibility test suite to test compatibility 
 between the different sets of cherry picked patches. We also don't have 
 time to ensure that all of the 1,000's of patches applied to 0.20.2 in 
 each of the many (10? 15?) different versions have been committed to 
 trunk. Futhermore, under the Apache license, a company Foo could claim 
 that it is a cherry pick version of Hadoop without releasing their source 
 code that would enable verification.
 
 In summary,
 1. Hadoop is very successful.
 2. There are many different commercial products that are trying to use the 
 Hadoop name.
 3. We can't check or enforce that the cherry pick versions are following 
 the rules.
 4. We don't have a TCK like Java does to validate new versions are 
 compatible.
 5. By far the most fair way to ensure compatibility and fairness between 
 companies is that only Apache Hadoop releases may be called Hadoop.
 
 That said, a package that includes a small number (< 3) of security 
 patches that haven't been released yet

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Craig L Russell

Hi Matthew,

I'm sorry I have to disagree.

If you change a bit in a work, it becomes a derived work. There's no  
"demotion" involved. Just a definition of derived work.


There's no ambiguity. Either you ship the bits that the Apache PMC has  
voted on as a release, or you change it (one bit) and it is no longer  
what the PMC has voted on. It's a derived work.


The rules for voting in Apache require that if you change a bit in an  
artifact, you can no longer count votes for the previous artifact.  
Because the new work is different. A new vote is required.


Not gray. Black and white.

Simple as that.

Craig

P.S. for the anthropologists, look at the history of Apache Derby and  
Sun JavaDB. Meaningful, specific example.


On Jun 15, 2011, at 6:17 PM, Matthew Foley wrote:


I tend to agree with what I think you are saying, that
* applying a small-number-of-patches that are
* for high-severity-bug-fixes, and
* have been Apache-Hadoop-committed
to an Apache Hadoop release should not demote the result to a  
"derived work".
However, if so many patches are applied that the result cannot be  
meaningfully

correlated with a specific Apache Hadoop release, then it probably has
become a derived work.

But how do we draw a meaningful line across that big gray area?   
That's why I'd like to
see specific text from one of the other projects you cited as an  
example.


Thanks,
--Matt


On Jun 15, 2011, at 6:02 PM, Eli Collins wrote:

On Wed, Jun 15, 2011 at 10:44 AM, Matthew Foley inc.com> wrote:

Eli, you said:
Putting a build of Hadoop that has 4 security patches applied into  
the same

category as a product that has entirely re-worked the code and not
gotten it checked into trunk does a major disservice to the people  
who

contribute to and invest in the project.


How would you phrase the distinction, so that it is clear and  
reasonably unambiguous
for people who are not Hadoop developers?  Do the HTTP and  
Subversion policies
draw this distinction, and if so could you please point us at the  
specific text, or

copy that text to this thread?



I'll try to find it, this was told to me verbally a while back. Maybe
Roy can chime in.

Since there seems to be some confusion around distribution we should
make this explicit.  Some people are currently interpreting the
guidelines to say that if you patch an Apache Hadoop release yourself
then you're still running Apache Hadoop.  But if a vendor patches
Apache Hadoop for you then you're not running Apache Hadoop. How about
if a subcontractor patches Apache Hadoop for you, then is it Apache
Hadoop? This isn't sustainable.

Thanks,
Eli

Thanks,
--Matt


On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:

On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley   
wrote:


On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:

Wrt derivative works, it's not clear from the document, but I  
think we

should explicitly adopt the policy of HTTPD and Subversion that
backported patches from trunk and security fixes are permitted.


Actually, the document is extremely clear that only Apache  
releases may be called Hadoop.


There was a very long thread about why the rapidly expanding  
Hadoop-ecosystem is leading to at lot of customer confusion about  
the different "versions" of Hadoop. We as the Hadoop project don't  
have the resources or the necessary compatibility test suite to  
test compatibility between the different sets of cherry picked  
patches. We also don't have time to ensure that all of the 1,000's  
of patches applied to 0.20.2 in each of the many (10? 15?)  
different versions have been committed to trunk. Futhermore, under  
the Apache license, a company Foo could claim that it is a cherry  
pick version of Hadoop without releasing their source code that  
would enable verification.


In summary,
1. Hadoop is very successful.
2. There are many different commercial products that are trying to  
use the Hadoop name.
3. We can't check or enforce that the cherry pick versions are  
following the rules.
4. We don't have a TCK like Java does to validate new versions are  
compatible.
5. By far the most fair way to ensure compatibility and fairness  
between companies is that only Apache Hadoop releases may be  
called Hadoop.


That said, a package that includes a small number (< 3) of  
security patches that haven't been released yet doesn't seem  
unreasonable.




I've spoken with ops teams at many companies,  I am not aware of
anyone who runs an official release (with just 2 security patches).  
By

this definition many of the most valuable contributors to Hadoop,
including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
that really the message we want to send? We expect the PMC to enforce
this equally across all parties?

It's a fact of life that companies and ops teams that support Hadoop
need to patch the software before the PMC has time and/or will to  
vote
on new releases. This is why HTTP and Subversion allow this.  
Putting a


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Matthew Foley
I tend to agree with what I think you are saying, that 
* applying a small-number-of-patches that are
* for high-severity-bug-fixes, and
* have been Apache-Hadoop-committed 
to an Apache Hadoop release should not demote the result to a "derived work". 
However, if so many patches are applied that the result cannot be meaningfully 
correlated with a specific Apache Hadoop release, then it probably has
become a derived work.

But how do we draw a meaningful line across that big gray area?  That's why I'd 
like to
see specific text from one of the other projects you cited as an example.

Thanks,
--Matt


On Jun 15, 2011, at 6:02 PM, Eli Collins wrote:

On Wed, Jun 15, 2011 at 10:44 AM, Matthew Foley  wrote:
> Eli, you said:
>> Putting a build of Hadoop that has 4 security patches applied into the same
>> category as a product that has entirely re-worked the code and not
>> gotten it checked into trunk does a major disservice to the people who
>> contribute to and invest in the project.
> 
> How would you phrase the distinction, so that it is clear and reasonably 
> unambiguous
> for people who are not Hadoop developers?  Do the HTTP and Subversion policies
> draw this distinction, and if so could you please point us at the specific 
> text, or
> copy that text to this thread?
> 

I'll try to find it, this was told to me verbally a while back. Maybe
Roy can chime in.

Since there seems to be some confusion around distribution we should
make this explicit.  Some people are currently interpreting the
guidelines to say that if you patch an Apache Hadoop release yourself
then you're still running Apache Hadoop.  But if a vendor patches
Apache Hadoop for you then you're not running Apache Hadoop. How about
if a subcontractor patches Apache Hadoop for you, then is it Apache
Hadoop? This isn't sustainable.

Thanks,
Eli
> Thanks,
> --Matt
> 
> 
> On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:
> 
> On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
>> 
>> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
>> 
>>> Wrt derivative works, it's not clear from the document, but I think we
>>> should explicitly adopt the policy of HTTPD and Subversion that
>>> backported patches from trunk and security fixes are permitted.
>> 
>> Actually, the document is extremely clear that only Apache releases may be 
>> called Hadoop.
>> 
>> There was a very long thread about why the rapidly expanding 
>> Hadoop-ecosystem is leading to at lot of customer confusion about the 
>> different "versions" of Hadoop. We as the Hadoop project don't have the 
>> resources or the necessary compatibility test suite to test compatibility 
>> between the different sets of cherry picked patches. We also don't have time 
>> to ensure that all of the 1,000's of patches applied to 0.20.2 in each of 
>> the many (10? 15?) different versions have been committed to trunk. 
>> Futhermore, under the Apache license, a company Foo could claim that it is a 
>> cherry pick version of Hadoop without releasing their source code that would 
>> enable verification.
>> 
>> In summary,
>>  1. Hadoop is very successful.
>>  2. There are many different commercial products that are trying to use the 
>> Hadoop name.
>>  3. We can't check or enforce that the cherry pick versions are following 
>> the rules.
>>  4. We don't have a TCK like Java does to validate new versions are 
>> compatible.
>>  5. By far the most fair way to ensure compatibility and fairness between 
>> companies is that only Apache Hadoop releases may be called Hadoop.
>> 
>> That said, a package that includes a small number (< 3) of security patches 
>> that haven't been released yet doesn't seem unreasonable.
>> 
> 
> I've spoken with ops teams at many companies,  I am not aware of
> anyone who runs an official release (with just 2 security patches). By
> this definition many of the most valuable contributors to Hadoop,
> including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
> that really the message we want to send? We expect the PMC to enforce
> this equally across all parties?
> 
> It's a fact of life that companies and ops teams that support Hadoop
> need to patch the software before the PMC has time and/or will to vote
> on new releases. This is why HTTP and Subversion allow this. Putting a
> build of Hadoop that has 4 security patches applied into the same
> category as a product that has entirely re-worked the code and not
> gotten it checked into trunk does a major disservice to the people who
> contribute to and invest in the project.
> 
> Thanks,
> Eli
> 
> 




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Wed, Jun 15, 2011 at 10:44 AM, Matthew Foley  wrote:
> Eli, you said:
>> Putting a build of Hadoop that has 4 security patches applied into the same
>> category as a product that has entirely re-worked the code and not
>> gotten it checked into trunk does a major disservice to the people who
>> contribute to and invest in the project.
>
> How would you phrase the distinction, so that it is clear and reasonably 
> unambiguous
> for people who are not Hadoop developers?  Do the HTTP and Subversion policies
> draw this distinction, and if so could you please point us at the specific 
> text, or
> copy that text to this thread?
>

I'll try to find it, this was told to me verbally a while back. Maybe
Roy can chime in.

Since there seems to be some confusion around distribution we should
make this explicit.  Some people are currently interpreting the
guidelines to say that if you patch an Apache Hadoop release yourself
then you're still running Apache Hadoop.  But if a vendor patches
Apache Hadoop for you then you're not running Apache Hadoop. How about
if a subcontractor patches Apache Hadoop for you, then is it Apache
Hadoop? This isn't sustainable.

Thanks,
Eli
> Thanks,
> --Matt
>
>
> On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:
>
> On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
>>
>> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
>>
>>> Wrt derivative works, it's not clear from the document, but I think we
>>> should explicitly adopt the policy of HTTPD and Subversion that
>>> backported patches from trunk and security fixes are permitted.
>>
>> Actually, the document is extremely clear that only Apache releases may be 
>> called Hadoop.
>>
>> There was a very long thread about why the rapidly expanding 
>> Hadoop-ecosystem is leading to at lot of customer confusion about the 
>> different "versions" of Hadoop. We as the Hadoop project don't have the 
>> resources or the necessary compatibility test suite to test compatibility 
>> between the different sets of cherry picked patches. We also don't have time 
>> to ensure that all of the 1,000's of patches applied to 0.20.2 in each of 
>> the many (10? 15?) different versions have been committed to trunk. 
>> Futhermore, under the Apache license, a company Foo could claim that it is a 
>> cherry pick version of Hadoop without releasing their source code that would 
>> enable verification.
>>
>> In summary,
>>  1. Hadoop is very successful.
>>  2. There are many different commercial products that are trying to use the 
>> Hadoop name.
>>  3. We can't check or enforce that the cherry pick versions are following 
>> the rules.
>>  4. We don't have a TCK like Java does to validate new versions are 
>> compatible.
>>  5. By far the most fair way to ensure compatibility and fairness between 
>> companies is that only Apache Hadoop releases may be called Hadoop.
>>
>> That said, a package that includes a small number (< 3) of security patches 
>> that haven't been released yet doesn't seem unreasonable.
>>
>
> I've spoken with ops teams at many companies,  I am not aware of
> anyone who runs an official release (with just 2 security patches). By
> this definition many of the most valuable contributors to Hadoop,
> including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
> that really the message we want to send? We expect the PMC to enforce
> this equally across all parties?
>
> It's a fact of life that companies and ops teams that support Hadoop
> need to patch the software before the PMC has time and/or will to vote
> on new releases. This is why HTTP and Subversion allow this. Putting a
> build of Hadoop that has 4 security patches applied into the same
> category as a product that has entirely re-worked the code and not
> gotten it checked into trunk does a major disservice to the people who
> contribute to and invest in the project.
>
> Thanks,
> Eli
>
>


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Wed, Jun 15, 2011 at 3:42 PM, Chris Douglas  wrote:
> On Wed, Jun 15, 2011 at 3:25 PM, Eli Collins  wrote:
>> But Yahoo! hasn't.  According to this wiki YDH (0.20.100) would *not*
>> be considered Apache Hadoop. For example see HADOOP-6962 which refers
>> to 0.20.9, an internal Yahoo! release, not an official Apache release.
>> Are you really comfortable saying Yahoo! doesn't run Hadoop?
>
> This conclusion does not follow. The guidelines prohibit companies
> from distributing that software as "Hadoop". It takes no position on
> whether a company that modifies a release of Apache Hadoop for its
> deployment credits the project.
>

This is independent of distribution. The guideline clearly defines
such an artifact as a "derivative work" and states that "Products that
are derivative works of Apache Hadoop are not Apache Hadoop".
Therefore it's false for the company to claim they are using Apache
Hadoop.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Chris Douglas
On Wed, Jun 15, 2011 at 3:25 PM, Eli Collins  wrote:
> But Yahoo! hasn't.  According to this wiki YDH (0.20.100) would *not*
> be considered Apache Hadoop. For example see HADOOP-6962 which refers
> to 0.20.9, an internal Yahoo! release, not an official Apache release.
> Are you really comfortable saying Yahoo! doesn't run Hadoop?

This conclusion does not follow. The guidelines prohibit companies
from distributing that software as "Hadoop". It takes no position on
whether a company that modifies a release of Apache Hadoop for its
deployment credits the project.

On Wed, Jun 15, 2011 at 11:13 AM, Ted Dunning  wrote:
> In addition, I think that the limitations on usage are too strict.  For
> instance, if "QuickBooks for Windows" [1] doesn't cause Microsoft to sue
> Intuit, then "Joe's Foo for Apache Hadoop" really shouldn't cause any more
> grief.

This analogy is also inexact. QuickBooks is an application running on
Windows, not a replacement for it. -C


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Wed, Jun 15, 2011 at 11:37 AM, Arun C Murthy  wrote:
>
> On Jun 15, 2011, at 10:10 PM, Eli Collins wrote:
>
>> I've spoken with ops teams at many companies,  I am not aware of
>> anyone who runs an official release (with just 2 security patches). By
>> this definition many of the most valuable contributors to Hadoop,
>> including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
>> that really the message we want to send? We expect the PMC to enforce
>> this equally across all parties?
>
> This is only a recent (less than 2 yrs) phenomenon with hadoop-0.20 onwards.
>
> I've been on the project for over 5 years now and I've run official Apache
> Hadoop releases at Y! for the majority of that time. From hadoop-0.1 to
> hadoop-0.18.

But Yahoo! hasn't.  According to this wiki YDH (0.20.100) would *not*
be considered Apache Hadoop. For example see HADOOP-6962 which refers
to 0.20.9, an internal Yahoo! release, not an official Apache release.
Are you really comfortable saying Yahoo! doesn't run Hadoop?

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Arun C Murthy


On Jun 15, 2011, at 10:10 PM, Eli Collins wrote:


I've spoken with ops teams at many companies,  I am not aware of
anyone who runs an official release (with just 2 security patches). By
this definition many of the most valuable contributors to Hadoop,
including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
that really the message we want to send? We expect the PMC to enforce
this equally across all parties?


This is only a recent (less than 2 yrs) phenomenon with hadoop-0.20  
onwards.


I've been on the project for over 5 years now and I've run official  
Apache Hadoop releases at Y! for the majority of that time. From  
hadoop-0.1 to hadoop-0.18.


IAC, getting everyone to run an official release isn't an anti-goal.  
And, as Steve points out, this really doesn't concern internal  
deployments - public redistributables is something I worry about as a  
PMC member with my Apache hat on.


+1 for the current version (Defining Hadoop (last edited 2011-06-09  
02:56:39 by OwenOMalley)


thanks,
Arun




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Ted Dunning
+1 to what Eli says.  If nobody is running official Hadoop according to this
definition, but everybody thinks that they are running hadoop, then this
definition is a bit out of whack.  The source of the dissonance is related
to the fact that release just don't happen often enough in Hadoop.

In addition, I think that the limitations on usage are too strict.  For
instance, if "QuickBooks for Windows" [1] doesn't cause Microsoft to sue
Intuit, then "Joe's Foo for Apache Hadoop" really shouldn't cause any more
grief.

So I would give a (non-binding) -1 to the policy as stated.

[1]
http://quickbooks.intuit.com/product/accounting_software/windows_financial_management_software.jsp

On Wed, Jun 15, 2011 at 6:40 PM, Eli Collins  wrote:

> On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
> >
> > On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
> >
> >> Wrt derivative works, it's not clear from the document, but I think we
> >> should explicitly adopt the policy of HTTPD and Subversion that
> >> backported patches from trunk and security fixes are permitted.
> >
> > Actually, the document is extremely clear that only Apache releases may
> be called Hadoop.
> >
> > There was a very long thread about why the rapidly expanding
> Hadoop-ecosystem is leading to at lot of customer confusion about the
> different "versions" of Hadoop. We as the Hadoop project don't have the
> resources or the necessary compatibility test suite to test compatibility
> between the different sets of cherry picked patches. We also don't have time
> to ensure that all of the 1,000's of patches applied to 0.20.2 in each of
> the many (10? 15?) different versions have been committed to trunk.
> Futhermore, under the Apache license, a company Foo could claim that it is a
> cherry pick version of Hadoop without releasing their source code that would
> enable verification.
> >
> > In summary,
> >  1. Hadoop is very successful.
> >  2. There are many different commercial products that are trying to use
> the Hadoop name.
> >  3. We can't check or enforce that the cherry pick versions are following
> the rules.
> >  4. We don't have a TCK like Java does to validate new versions are
> compatible.
> >  5. By far the most fair way to ensure compatibility and fairness between
> companies is that only Apache Hadoop releases may be called Hadoop.
> >
> > That said, a package that includes a small number (< 3) of security
> patches that haven't been released yet doesn't seem unreasonable.
> >
>
> I've spoken with ops teams at many companies,  I am not aware of
> anyone who runs an official release (with just 2 security patches). By
> this definition many of the most valuable contributors to Hadoop,
> including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
> that really the message we want to send? We expect the PMC to enforce
> this equally across all parties?
>
> It's a fact of life that companies and ops teams that support Hadoop
> need to patch the software before the PMC has time and/or will to vote
> on new releases. This is why HTTP and Subversion allow this. Putting a
> build of Hadoop that has 4 security patches applied into the same
> category as a product that has entirely re-worked the code and not
> gotten it checked into trunk does a major disservice to the people who
> contribute to and invest in the project.
>
> Thanks,
> Eli
>


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Matthew Foley
Oh, and while I can't officially vote, I think this page is extremely well done 
and
I strongly support it.  

As an editorial note, however, I would remove the last paragraph in the 
"Compatibility" 
section, referencing the email thread (that I contributed to at length :-) ).  
That thread went all over the place, and would be misinforming to the typical
reader.  The distillation on this twiki page IS normative and not confusing, 
and 
we should leave it at that.

Best,
--Matt

On Jun 15, 2011, at 10:44 AM, Matthew Foley wrote:

Eli, you said:
> Putting a build of Hadoop that has 4 security patches applied into the same
> category as a product that has entirely re-worked the code and not
> gotten it checked into trunk does a major disservice to the people who
> contribute to and invest in the project.

How would you phrase the distinction, so that it is clear and reasonably 
unambiguous
for people who are not Hadoop developers?  Do the HTTP and Subversion policies
draw this distinction, and if so could you please point us at the specific 
text, or 
copy that text to this thread?

Thanks,
--Matt


On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:

On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
> 
> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
> 
>> Wrt derivative works, it's not clear from the document, but I think we
>> should explicitly adopt the policy of HTTPD and Subversion that
>> backported patches from trunk and security fixes are permitted.
> 
> Actually, the document is extremely clear that only Apache releases may be 
> called Hadoop.
> 
> There was a very long thread about why the rapidly expanding Hadoop-ecosystem 
> is leading to at lot of customer confusion about the different "versions" of 
> Hadoop. We as the Hadoop project don't have the resources or the necessary 
> compatibility test suite to test compatibility between the different sets of 
> cherry picked patches. We also don't have time to ensure that all of the 
> 1,000's of patches applied to 0.20.2 in each of the many (10? 15?) different 
> versions have been committed to trunk. Futhermore, under the Apache license, 
> a company Foo could claim that it is a cherry pick version of Hadoop without 
> releasing their source code that would enable verification.
> 
> In summary,
> 1. Hadoop is very successful.
> 2. There are many different commercial products that are trying to use the 
> Hadoop name.
> 3. We can't check or enforce that the cherry pick versions are following the 
> rules.
> 4. We don't have a TCK like Java does to validate new versions are compatible.
> 5. By far the most fair way to ensure compatibility and fairness between 
> companies is that only Apache Hadoop releases may be called Hadoop.
> 
> That said, a package that includes a small number (< 3) of security patches 
> that haven't been released yet doesn't seem unreasonable.
> 

I've spoken with ops teams at many companies,  I am not aware of
anyone who runs an official release (with just 2 security patches). By
this definition many of the most valuable contributors to Hadoop,
including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
that really the message we want to send? We expect the PMC to enforce
this equally across all parties?

It's a fact of life that companies and ops teams that support Hadoop
need to patch the software before the PMC has time and/or will to vote
on new releases. This is why HTTP and Subversion allow this. Putting a
build of Hadoop that has 4 security patches applied into the same
category as a product that has entirely re-worked the code and not
gotten it checked into trunk does a major disservice to the people who
contribute to and invest in the project.

Thanks,
Eli




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Matthew Foley
Eli, you said:
> Putting a build of Hadoop that has 4 security patches applied into the same
> category as a product that has entirely re-worked the code and not
> gotten it checked into trunk does a major disservice to the people who
> contribute to and invest in the project.

How would you phrase the distinction, so that it is clear and reasonably 
unambiguous
for people who are not Hadoop developers?  Do the HTTP and Subversion policies
draw this distinction, and if so could you please point us at the specific 
text, or 
copy that text to this thread?

Thanks,
--Matt


On Jun 15, 2011, at 9:40 AM, Eli Collins wrote:

On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
> 
> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
> 
>> Wrt derivative works, it's not clear from the document, but I think we
>> should explicitly adopt the policy of HTTPD and Subversion that
>> backported patches from trunk and security fixes are permitted.
> 
> Actually, the document is extremely clear that only Apache releases may be 
> called Hadoop.
> 
> There was a very long thread about why the rapidly expanding Hadoop-ecosystem 
> is leading to at lot of customer confusion about the different "versions" of 
> Hadoop. We as the Hadoop project don't have the resources or the necessary 
> compatibility test suite to test compatibility between the different sets of 
> cherry picked patches. We also don't have time to ensure that all of the 
> 1,000's of patches applied to 0.20.2 in each of the many (10? 15?) different 
> versions have been committed to trunk. Futhermore, under the Apache license, 
> a company Foo could claim that it is a cherry pick version of Hadoop without 
> releasing their source code that would enable verification.
> 
> In summary,
>  1. Hadoop is very successful.
>  2. There are many different commercial products that are trying to use the 
> Hadoop name.
>  3. We can't check or enforce that the cherry pick versions are following the 
> rules.
>  4. We don't have a TCK like Java does to validate new versions are 
> compatible.
>  5. By far the most fair way to ensure compatibility and fairness between 
> companies is that only Apache Hadoop releases may be called Hadoop.
> 
> That said, a package that includes a small number (< 3) of security patches 
> that haven't been released yet doesn't seem unreasonable.
> 

I've spoken with ops teams at many companies,  I am not aware of
anyone who runs an official release (with just 2 security patches). By
this definition many of the most valuable contributors to Hadoop,
including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
that really the message we want to send? We expect the PMC to enforce
this equally across all parties?

It's a fact of life that companies and ops teams that support Hadoop
need to patch the software before the PMC has time and/or will to vote
on new releases. This is why HTTP and Subversion allow this. Putting a
build of Hadoop that has 4 security patches applied into the same
category as a product that has entirely re-worked the code and not
gotten it checked into trunk does a major disservice to the people who
contribute to and invest in the project.

Thanks,
Eli



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Wed, Jun 15, 2011 at 9:44 AM, Steve Loughran  wrote:
> On 15/06/11 17:23, Eli Collins wrote:
>>
>> On Tue, Jun 14, 2011 at 7:46 PM, Allen Wittenauer  wrote:
>>>
>>> On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:

 Are we really going to go after all the web companies that patch in an
 enhancement to their current Hadoop build and tell them to stop saying
 that they are using Hadoop?  You've patched Hadoop many times, should
 your employer not be able to say they use Hadoop?  I'm -1 on a
 proposal that does this.
>>>
>>>        I think there is a big difference between some company that uses
>>> Hadoop with some patches internally and a company that puts out a
>>> distribution for others to use, usually for-profit.
>>
>> The wiki makes no such distinction. The PMC will apply the rules
>> equally to all parties.
>>
>> According to Owen's email if you are using a release of Apache Hadoop
>> and have applied more than 2 security patches or any backports you are
>> not using Hadoop.
>>
>> Thanks,
>> Eli
>
> What you do in house is of no concern to the trademarks and PMC people, but
> naming of public redistributables is -and that's where the confusion of what
> "a distribution of Apache Hadoop" is, because it's gone from weakly defined
> to very vague recently, and that needs to be corrected before people are
> left in a world of confusion.
>

Steve, I'm on the PMC and it is a concern.  What happens in house
often gets released on github, documented, blogged about, etc. All
this stuff creates confusion about the product and is therefore a
concern of the PMC.

>
> It's been complicated enough with people posting issues related to the
> Cloudera Distribution including Apache Hadoop, what happens when people
> start posting EMC-enterprise-hadoopish issues, file bugreps against Brisk's
> "Hadoop built on other things" product on the apache JIRA?
>

The same thing we do today. We point them to another more appropriate forum.

Isn't your proposal w/ the HTTPD/Subversion policy wrt backporting
effective?  Note that it's pretty strict, you have to get your code
committed to a branch that will be released subject to approval by the
PMC.  It's not saying that anyone can do whatever they want to the
Hadoop source and call it Hadoop.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Steve Loughran

On 15/06/11 17:23, Eli Collins wrote:

On Tue, Jun 14, 2011 at 7:46 PM, Allen Wittenauer  wrote:


On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:

Are we really going to go after all the web companies that patch in an
enhancement to their current Hadoop build and tell them to stop saying
that they are using Hadoop?  You've patched Hadoop many times, should
your employer not be able to say they use Hadoop?  I'm -1 on a
proposal that does this.


I think there is a big difference between some company that uses Hadoop 
with some patches internally and a company that puts out a distribution for 
others to use, usually for-profit.


The wiki makes no such distinction. The PMC will apply the rules
equally to all parties.

According to Owen's email if you are using a release of Apache Hadoop
and have applied more than 2 security patches or any backports you are
not using Hadoop.

Thanks,
Eli


What you do in house is of no concern to the trademarks and PMC people, 
but naming of public redistributables is -and that's where the confusion 
of what "a distribution of Apache Hadoop" is, because it's gone from 
weakly defined to very vague recently, and that needs to be corrected 
before people are left in a world of confusion.



It's been complicated enough with people posting issues related to the 
Cloudera Distribution including Apache Hadoop, what happens when people 
start posting EMC-enterprise-hadoopish issues, file bugreps against 
Brisk's "Hadoop built on other things" product on the apache JIRA?


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Tue, Jun 14, 2011 at 7:45 PM, Owen O'Malley  wrote:
>
> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
>
>> Wrt derivative works, it's not clear from the document, but I think we
>> should explicitly adopt the policy of HTTPD and Subversion that
>> backported patches from trunk and security fixes are permitted.
>
> Actually, the document is extremely clear that only Apache releases may be 
> called Hadoop.
>
> There was a very long thread about why the rapidly expanding Hadoop-ecosystem 
> is leading to at lot of customer confusion about the different "versions" of 
> Hadoop. We as the Hadoop project don't have the resources or the necessary 
> compatibility test suite to test compatibility between the different sets of 
> cherry picked patches. We also don't have time to ensure that all of the 
> 1,000's of patches applied to 0.20.2 in each of the many (10? 15?) different 
> versions have been committed to trunk. Futhermore, under the Apache license, 
> a company Foo could claim that it is a cherry pick version of Hadoop without 
> releasing their source code that would enable verification.
>
> In summary,
>  1. Hadoop is very successful.
>  2. There are many different commercial products that are trying to use the 
> Hadoop name.
>  3. We can't check or enforce that the cherry pick versions are following the 
> rules.
>  4. We don't have a TCK like Java does to validate new versions are 
> compatible.
>  5. By far the most fair way to ensure compatibility and fairness between 
> companies is that only Apache Hadoop releases may be called Hadoop.
>
> That said, a package that includes a small number (< 3) of security patches 
> that haven't been released yet doesn't seem unreasonable.
>

I've spoken with ops teams at many companies,  I am not aware of
anyone who runs an official release (with just 2 security patches). By
this definition many of the most valuable contributors to Hadoop,
including Yahoo!, Cloudera, Facebook, etc are not using Hadoop.  Is
that really the message we want to send? We expect the PMC to enforce
this equally across all parties?

It's a fact of life that companies and ops teams that support Hadoop
need to patch the software before the PMC has time and/or will to vote
on new releases. This is why HTTP and Subversion allow this. Putting a
build of Hadoop that has 4 security patches applied into the same
category as a product that has entirely re-worked the code and not
gotten it checked into trunk does a major disservice to the people who
contribute to and invest in the project.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Eli Collins
On Tue, Jun 14, 2011 at 7:46 PM, Allen Wittenauer  wrote:
>
> On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:
>> Are we really going to go after all the web companies that patch in an
>> enhancement to their current Hadoop build and tell them to stop saying
>> that they are using Hadoop?  You've patched Hadoop many times, should
>> your employer not be able to say they use Hadoop?  I'm -1 on a
>> proposal that does this.
>
>        I think there is a big difference between some company that uses 
> Hadoop with some patches internally and a company that puts out a 
> distribution for others to use, usually for-profit.

The wiki makes no such distinction. The PMC will apply the rules
equally to all parties.

According to Owen's email if you are using a release of Apache Hadoop
and have applied more than 2 security patches or any backports you are
not using Hadoop.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Konstantin Boudnik
On Wed, Jun 15, 2011 at 02:52, Steve Loughran  wrote:
> On 15/06/11 03:51, Konstantin Boudnik wrote:
>>
>> On Tue, Jun 14, 2011 at 19:46, Allen Wittenauer  wrote:
>>>
>>> On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:

 Are we really going to go after all the web companies that patch in an
 enhancement to their current Hadoop build and tell them to stop saying
 that they are using Hadoop?  You've patched Hadoop many times, should
 your employer not be able to say they use Hadoop?  I'm -1 on a
 proposal that does this.
>>>
>>>        I think there is a big difference between some company that uses
>>> Hadoop with some patches internally and a company that puts out a
>>> distribution for others to use, usually for-profit.
>>
>> Just as the reminder: this whole conversation has started as a result
>> of EMC announcement of 100% compatible version of Apache Hadoop. So,
>> Allen's point is right on target here: the above example is simply
>> incorrect.
>
> I seem to recall this dicussion starting a bit earlier, with the whole
> notion of compatibility, before EMC got involved.
>
> Regarding the vote, I think the discussion here is interesting and should be
> finalised before the vote. It's worth resolving the issues.
>
> also: banners, stickers and clothing? Can I have T-shirts saying "I broke
> the hadoop build" with the logo on, or should it be "I broke the Apache
> Hadoop build"?

I think such a T-shirt should be forcefully worn on any person who did
just that.


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Steve Loughran

On 15/06/11 00:35, Allen Wittenauer wrote:


A minor nit: I'd like to see some cleanup between the first paragraph and the 
fourth paragraph in compatibility.  Or was the re-iteration of our "not a standards 
committee" intentional?  It is sort of awkward as it is currently written.


well it is a wiki...



Also, where can I download Camshaft?


It's a fork of Hadoop 0.15 optimised for Windows ME and FAT32 that 
requires a human to fetch blocks from remote machines using a floppy - a 
process that limits blocksize to 1.44MB and kills your latency. You 
don't really want it.


What you saw on the page was marketing's spin on the harsh truth.


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Steve Loughran

On 15/06/11 03:51, Konstantin Boudnik wrote:

On Tue, Jun 14, 2011 at 19:46, Allen Wittenauer  wrote:


On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:

Are we really going to go after all the web companies that patch in an
enhancement to their current Hadoop build and tell them to stop saying
that they are using Hadoop?  You've patched Hadoop many times, should
your employer not be able to say they use Hadoop?  I'm -1 on a
proposal that does this.


I think there is a big difference between some company that uses Hadoop 
with some patches internally and a company that puts out a distribution for 
others to use, usually for-profit.


Just as the reminder: this whole conversation has started as a result
of EMC announcement of 100% compatible version of Apache Hadoop. So,
Allen's point is right on target here: the above example is simply
incorrect.


I seem to recall this dicussion starting a bit earlier, with the whole 
notion of compatibility, before EMC got involved.


Regarding the vote, I think the discussion here is interesting and 
should be finalised before the vote. It's worth resolving the issues.


also: banners, stickers and clothing? Can I have T-shirts saying "I 
broke the hadoop build" with the logo on, or should it be "I broke the 
Apache Hadoop build"?


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-15 Thread Steve Loughran

On 15/06/11 02:15, Allen Wittenauer wrote:


 I run out of fingers if I count how many times just the mapred.map.child.java.opts was 
said to be "in 20" prior to the 0.20.203 release...]


yeah, that incident involving Camshaft 3.02 beta and your left hand 
really reduced your counting ability.




Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Konstantin Boudnik
On Tue, Jun 14, 2011 at 19:46, Allen Wittenauer  wrote:
>
> On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:
>> Are we really going to go after all the web companies that patch in an
>> enhancement to their current Hadoop build and tell them to stop saying
>> that they are using Hadoop?  You've patched Hadoop many times, should
>> your employer not be able to say they use Hadoop?  I'm -1 on a
>> proposal that does this.
>
>        I think there is a big difference between some company that uses 
> Hadoop with some patches internally and a company that puts out a 
> distribution for others to use, usually for-profit.

Just as the reminder: this whole conversation has started as a result
of EMC announcement of 100% compatible version of Apache Hadoop. So,
Allen's point is right on target here: the above example is simply
incorrect.

Cos


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Allen Wittenauer

On Jun 14, 2011, at 6:45 PM, Eli Collins wrote:
> Are we really going to go after all the web companies that patch in an
> enhancement to their current Hadoop build and tell them to stop saying
> that they are using Hadoop?  You've patched Hadoop many times, should
> your employer not be able to say they use Hadoop?  I'm -1 on a
> proposal that does this.

I think there is a big difference between some company that uses Hadoop 
with some patches internally and a company that puts out a distribution for 
others to use, usually for-profit.

Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Owen O'Malley

On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:

> Wrt derivative works, it's not clear from the document, but I think we
> should explicitly adopt the policy of HTTPD and Subversion that
> backported patches from trunk and security fixes are permitted.

Actually, the document is extremely clear that only Apache releases may be 
called Hadoop.

There was a very long thread about why the rapidly expanding Hadoop-ecosystem 
is leading to at lot of customer confusion about the different "versions" of 
Hadoop. We as the Hadoop project don't have the resources or the necessary 
compatibility test suite to test compatibility between the different sets of 
cherry picked patches. We also don't have time to ensure that all of the 
1,000's of patches applied to 0.20.2 in each of the many (10? 15?) different 
versions have been committed to trunk. Futhermore, under the Apache license, a 
company Foo could claim that it is a cherry pick version of Hadoop without 
releasing their source code that would enable verification.

In summary,
  1. Hadoop is very successful.
  2. There are many different commercial products that are trying to use the 
Hadoop name.
  3. We can't check or enforce that the cherry pick versions are following the 
rules.
  4. We don't have a TCK like Java does to validate new versions are compatible.
  5. By far the most fair way to ensure compatibility and fairness between 
companies is that only Apache Hadoop releases may be called Hadoop.

That said, a package that includes a small number (< 3) of security patches 
that haven't been released yet doesn't seem unreasonable.

-- Owen



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Konstantin Boudnik
On Tue, Jun 14, 2011 at 18:15, Eli Collins  wrote:
> On Tue, Jun 14, 2011 at 3:56 PM, Owen O'Malley  wrote:
>> All,
>>   Steve Loughran has done some great work on defining what can be called 
>> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some 
>> cleanup from Noirin and Shane, I think we've got a really good base. I'd 
>> like a vote to approve the content (at the current revision 12) and put the 
>> content on our web site.
>>
>> Clearly, I'm +1.
>>
>> -- Owen
>
> I'd like to make another suggestion. Currently we call two types of
> things powered by Apache Hadoop:
>
> 1. Something that runs on Hadoop (eg HBase or Karmasphere)

To be completely precise Karmasphere doesn't 'run on Hadoop'. Their
products "integrate with a variety of Hadoop distributions and related
technologies..." as you can see here
http://karmasphere.com/Miscellaneous/overview.html. Although in case
of HBase you right ;)

Cos

> 2. Something that includes Hadoop artifacts/source code
>
> Shouldn't we distinguish between these two, such that the 2nd is not
> powered by Hadoop? Eg tc server is not powered by Apache Tomcat right?
>
> Apologies for having discussion on a vote thread but this is the first
> time I've seen the current revision and it seems reasonable to have an
> opportunity to discuss a specific revision before voting on it.
>
> Thanks,
> Eli
>


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Chris Douglas
+1 on revision 12. Thanks for all your work on this, Steve. -C

On Tue, Jun 14, 2011 at 3:56 PM, Owen O'Malley  wrote:
> All,
>    Steve Loughran has done some great work on defining what can be called
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some
> cleanup from Noirin and Shane, I think we've got a really good base. I'd
> like a vote to approve the content (at the current revision 12) and put the
> content on our web site.
> Clearly, I'm +1.
> -- Owen


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Eli Collins
On Tue, Jun 14, 2011 at 6:15 PM, Allen Wittenauer  wrote:
>
> On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
>> In short, an Apache Hadoop release with a backport of PMC approved
>> code or critical security fix is not powered by Hadoop, it is Hadoop,
>> while a new product that contains or runs atop Hadoop is powered by
>> Hadoop.
>>
>> Reasonable?
>
>        I'd say: Security, yes.  Features, no.
>
>        The reason I say this is because there have been many, many, many 
> posts in the -user mailing lists where people are confused as to what 
> versions have what features because their local branch has a back ported fix. 
>  [I think I run out of fingers if I count how many times just the 
> mapred.map.child.java.opts was said to be "in 20" prior to the 0.20.203 
> release...]
>
>        This also adds pressure to do timely releases. :)
>

I agree this is a problem, I don't think this is an effective means of
solving it.

Are we really going to go after all the web companies that patch in an
enhancement to their current Hadoop build and tell them to stop saying
that they are using Hadoop?  You've patched Hadoop many times, should
your employer not be able to say they use Hadoop?  I'm -1 on a
proposal that does this.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Eli Collins
On Tue, Jun 14, 2011 at 3:56 PM, Owen O'Malley  wrote:
> All,
>   Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.
>
> Clearly, I'm +1.
>
> -- Owen

I'd like to make another suggestion. Currently we call two types of
things powered by Apache Hadoop:

1. Something that runs on Hadoop (eg HBase or Karmasphere)
2. Something that includes Hadoop artifacts/source code

Shouldn't we distinguish between these two, such that the 2nd is not
powered by Hadoop? Eg tc server is not powered by Apache Tomcat right?

Apologies for having discussion on a vote thread but this is the first
time I've seen the current revision and it seems reasonable to have an
opportunity to discuss a specific revision before voting on it.

Thanks,
Eli


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Allen Wittenauer

On Jun 14, 2011, at 5:48 PM, Eli Collins wrote:
> In short, an Apache Hadoop release with a backport of PMC approved
> code or critical security fix is not powered by Hadoop, it is Hadoop,
> while a new product that contains or runs atop Hadoop is powered by
> Hadoop.
> 
> Reasonable?

I'd say: Security, yes.  Features, no.

The reason I say this is because there have been many, many, many posts 
in the -user mailing lists where people are confused as to what versions have 
what features because their local branch has a back ported fix.  [I think I run 
out of fingers if I count how many times just the mapred.map.child.java.opts 
was said to be "in 20" prior to the 0.20.203 release...]

This also adds pressure to do timely releases. :)



Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Eli Collins
On Tue, Jun 14, 2011 at 3:56 PM, Owen O'Malley  wrote:
> All,
>   Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.
>
> Clearly, I'm +1.
>
> -- Owen

Thanks for putting this together Steve, good stuff!

Wrt derivative works, it's not clear from the document, but I think we
should explicitly adopt the policy of HTTPD and Subversion that
backported patches from trunk and security fixes are permitted.
Specifically, that cherry-picking changes from trunk or release
branches and, in general, any code that's been subject to lazy
consensus approval by the PMC does not make you a derivative work. For
example, RedHat backports [1] to Apache HTTP and of course still calls
it Apache HTTP.

In short, an Apache Hadoop release with a backport of PMC approved
code or critical security fix is not powered by Hadoop, it is Hadoop,
while a new product that contains or runs atop Hadoop is powered by
Hadoop.

Reasonable?

Thanks,
Eli

1. https://access.redhat.com/security/updates/backporting/?sc_cid=3093


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Konstantin Boudnik
+1 - makes sense!

--
  Take care,
Konstantin (Cos) Boudnik
2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622

Disclaimer: Opinions expressed in this email are those of the author,
and do not necessarily represent the views of any company the author
might be affiliated with at the moment of writing.



On Tue, Jun 14, 2011 at 15:56, Owen O'Malley  wrote:
> All,
>   Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.
>
> Clearly, I'm +1.
>
> -- Owen


Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Ian Holsman
+1.
great job Steve!

On Jun 15, 2011, at 8:56 AM, Owen O'Malley wrote:

> All,
>Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.
> 
> Clearly, I'm +1.
> 
> -- Owen

--
Ian Holsman
i...@holsman.net
PH: +1-703 879-3128 AOLIM: ianholsman Skype:iholsman

Never explain - your friends do not need it and your enemies will not believe 
you anyway. Elbert Hubbard






Re: [VOTE] Shall we adopt the "Defining Hadoop" page

2011-06-14 Thread Allen Wittenauer

On Jun 14, 2011, at 3:56 PM, Owen O'Malley wrote:

> All,
>   Steve Loughran has done some great work on defining what can be called 
> Hadoop at http://wiki.apache.org/hadoop/Defining%20Hadoop. After some cleanup 
> from Noirin and Shane, I think we've got a really good base. I'd like a vote 
> to approve the content (at the current revision 12) and put the content on 
> our web site.
> 
> Clearly, I'm +1.


This is awesome.  Good job everyone!

A minor nit: I'd like to see some cleanup between the first paragraph 
and the fourth paragraph in compatibility.  Or was the re-iteration of our "not 
a standards committee" intentional?  It is sort of awkward as it is currently 
written.

Also, where can I download Camshaft?