Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Steve Loughran

On 13/05/11 05:52, Milind Bhandarkar wrote:

Ok, my mistake. They have only asked for documented specifications. I may
have been influenced by all the specifications I have read. All of them
were in English, which is characterized as a natural language.

But then, if you are proposing a specification in a non-natural-language,
isn't that called a test suite ? Or is there a middle ground ?


There's formal specifications in languages like Z, We don't really want 
to go there if we can help it, as all it lets you do is prove 
correctness if you're a mathematician, and I haven't found the 
mathematician plugin for Jenkins yet.


There's also languages like Extended ML, from Sanella et al, who may be 
familiar to Doug from his time in the frozen lands of the north 
(edinburgh):

http://homepages.inf.ed.ac.uk/dts/eml/
Some of the bits of spec in this language can be executed, as long as 
you don't start declaring things about state over time. Again, though, 
it's hard work, unless your target language is, say ML or Haskell, as 
there you can jump from Specification to Implementation fairly rapidly


Where the formal stuff is good for is things like consistency protocols, 
so I'd hope someone did get out the proofs for Zookeeper, so the rest of 
us can rely on it working.


-Steve




Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Steve Loughran

On 13/05/11 23:57, Allen Wittenauer wrote:


On May 13, 2011, at 3:53 PM, Ted Dunning wrote:


But "distribution Z includes X" kind of implies the existence of some such
that X != Y, Y != empty-set and X+Y = Z, at least in common usage.

Isn't that the same as a non-trunk change?

So doesn't this mean that your question reduces to the question of what
happens when non-Apache changes are made to an Apache release?  And isn't
that the definition of a derived work?



Yup. Which is why I doubt *any* commercial entity can claim "includes Apache 
Hadoop" (including Cloudera).




but they can claim it is a derivative work, which CDH clearly is, 
(Though if we were to come up with a formal declaration of what a 
derivative work is, we'd have to handle the fact that it is a superset. 
Even worse, you may realise a release is the ordered application of a 
sequence of patches, and if the patches are applied in a different order 
you may end up with a different body of source code...)


Something that implements the APIs may not be a derivative work, 
depending on how much of the original code is in there. You could look 
at the base classes and interfaces and produce a clean room 
implementation (relying on the notion that interfaces are a list of 
facts and not copyrightable in the US), but whoever does that may 
encounter the issue that Google's donation of the right to use their MR 
patent may not apply to such implementations.


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Steve Loughran

On 13/05/11 23:16, Doug Cutting wrote:

On 05/14/2011 12:13 AM, Allen Wittenauer wrote:

So what do we do about companies that release a product that says "includes Apache 
Hadoop" but includes patches that aren't committed to trunk?


We yell at them to get those patches into trunk already.  This policy
was clarified after that product was shipping.

Doug


I distributed some RPMs with my lifecycle branch in, I can't remember 
what I called them, but I'd better revist all my .spec files to make 
sure the text is valid. Even with 0.21 JARs, what should I call it?



sf-apache-hadoop-operations.rpm

"This RPM contains the JAR artifacts of Apache Hadoop 0.21 and SmartFrog 
components to manage hadoop clusters,  manipulate the distributed 
filesystems, and submit MapReduce jobs"


Would that work?


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Steve Loughran

On 13/05/11 07:16, Doug Cutting wrote:

Certification semms like mission creep.  Our mission is to produce
open-source software.  If we wish to produce testing software, that
seems fine.  But running a certification program for non-open-source
software seems like a different task.



+1

That said, some stricter definition of public interfaces may be useful 
for the related projects, as a consistent open source stack is strongly 
beneficial.



The Hadoop mark should only be used to refer to open-source software
produced by the ASF.  If other folks wish to make factual statements
concerning our software, e.g., that their proprietary software passes
tests that we've created, that may be fine, but I don't think we should
validate those claims by granting certifications to institutions.  That
ventures outside the mission of the ASF.  We are not an accrediting
organization.


+1. Apache is not a standards body, except in the form of "de-facto 
standards defined by working code and their test suite"


What it does have a strict rules about naming. We should formalise them 
and publish them on the wiki, then whenever some product gets 
press-released (it's like a beta-release, only earlier in the 
lifecycle), the vendor can be directed to the page and reminded of the 
T&Cs of the license and any trade marks.


What does this mean for T-Shirts and Stickers, incidentally?



Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Segel, Mike
But Cloudera's release is a bit murky.

The math example is a bit flawed...

X represents the set of stable releases.
Y represents the set of available patches.
C represents the set of Cloudera releases.

So if C contains a release X(n) plus a set of patches that is contained in Y,
Then does it not have the right to be considered Apache Hadoop?
It's my understanding is that any enhancement to Hadoop is made available to 
Apache and will eventually make it into a later release...

So while it may not be 'official' release X(z), all of it's components are in 
Apache.
(note: I'm talking about the core components and not Cloudera's additional 
toolsets that encompass Hadoop.)

Cloudera is clearly a derivative work.
And IMHO is the only one which can say ... 'Includes Apache Hadoop'.

That doesn't mean that others can't, depending on how they implemented their 
changes.
Based on EMC marketing material, they've done a rip and replace of HDFS.
So it wouldn't be a superset since it doesn't contain a complete subset, but 
contains code that implements the API... So they can't say 'Includes Apache 
Hadoop',but they can say it's a derivative work based on Apache Hadoop and then 
go on to show how and why, in their opinion their product is better.(that's 
marketing for you...)

Clearly there are others out there... 
Hadoop on Cassandra as an example...

Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the 
table...

But because Apache's licensing is so open, Apache will have a hard time 
controlling derivative works...  
I believe that Steve is incorrect in his assertion concerning potential loss of 
any patent protection. Again Apache's licensing is very open and as long as 
they follow Apache's Ts and Cs, they are covered.

Note: because I am sending this from my email address at my client, I am 
obliged to say that this email is my opinion and does not reflect on the 
opinion of my client...
(you know the rest)

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 16, 2011, at 6:02 AM, "Steve Loughran"  wrote:

> On 13/05/11 23:57, Allen Wittenauer wrote:
>> 
>> On May 13, 2011, at 3:53 PM, Ted Dunning wrote:
>> 
>>> But "distribution Z includes X" kind of implies the existence of some such
>>> that X != Y, Y != empty-set and X+Y = Z, at least in common usage.
>>> 
>>> Isn't that the same as a non-trunk change?
>>> 
>>> So doesn't this mean that your question reduces to the question of what
>>> happens when non-Apache changes are made to an Apache release?  And isn't
>>> that the definition of a derived work?
>> 
>> 
>>Yup. Which is why I doubt *any* commercial entity can claim "includes 
>> Apache Hadoop" (including Cloudera).
>> 
>> 
> 
> but they can claim it is a derivative work, which CDH clearly is, 
> (Though if we were to come up with a formal declaration of what a 
> derivative work is, we'd have to handle the fact that it is a superset. 
> Even worse, you may realise a release is the ordered application of a 
> sequence of patches, and if the patches are applied in a different order 
> you may end up with a different body of source code...)
> 
> Something that implements the APIs may not be a derivative work, 
> depending on how much of the original code is in there. You could look 
> at the base classes and interfaces and produce a clean room 
> implementation (relying on the notion that interfaces are a list of 
> facts and not copyrightable in the US), but whoever does that may 
> encounter the issue that Google's donation of the right to use their MR 
> patent may not apply to such implementations.


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Steve Loughran

On 16/05/11 13:00, Segel, Mike wrote:

But Cloudera's release is a bit murky.

The math example is a bit flawed...

X represents the set of stable releases.
Y represents the set of available patches.
C represents the set of Cloudera releases.

So if C contains a release X(n) plus a set of patches that is contained in Y,
Then does it not have the right to be considered Apache Hadoop?
It's my understanding is that any enhancement to Hadoop is made available to 
Apache and will eventually make it into a later release...


It certainly contains it.

Now, if you want to make life more complex:
-view the contributions to the code base as a series of patches P1...Pn, 
each of which changes the code.
-These patches are essentially functions that transform the source S to 
a new state S'.

-the initial state of the source codebase is S0.

Hypothesis: the order in which the patch functions are applied 
determines the final state of the source tree.


If patches P1 and P2 were applied in order, you would get a state

S' = P2(P1(S0))

Applying the patches in a different order, you get a new final state.
S'' = P1(P2(S0))


Question for the maths people then is: can you be sure that S' and S'' 
are the same. As it would seem to me that it depends on the nature of 
the function. It could be that the set of functions that SVN supports 
guarantees sameness, but given conflict resolution problems I've 
encountered in the past, I doubt this.


Assuming that my belief holds: that the order in which a series of SVN 
patches are executed determines the final state of the source tree, then 
saying the patch sets -the set of functions applied to the source- of 
two codebases are equivalent does not mean the final state of the code 
is the same unless the sequence of application is also the same.


That would then define an apache release as a strictly ordered sequence 
of patches, or at least an sequence of operations that leads to the same 
final code state, such as S0.20.3


(oh look, I've just written a formal definition of what a release is, 
though I've avoided defining what a function is. View them as planar 
projections in cartesian space or something)





So while it may not be 'official' release X(z), all of it's components are in 
Apache.
(note: I'm talking about the core components and not Cloudera's additional 
toolsets that encompass Hadoop.)

Cloudera is clearly a derivative work.
And IMHO is the only one which can say ... 'Includes Apache Hadoop'.


Once you start thinking about the ordering of the patch functions it 
gets complicated.



That doesn't mean that others can't, depending on how they implemented their 
changes.


yes, though again it depends on the sequence of functions applied to the 
released sourcecode, such as S0.20.3, to the version they ship.



So it wouldn't be a superset since it doesn't contain a complete subset, but 
contains code that implements the API... So they can't say 'Includes Apache 
Hadoop',but they can say it's a derivative work based on Apache Hadoop and then 
go on to show how and why, in their opinion their product is better.(that's 
marketing for you...)


I agree


Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the 
table...


Clearly, but there are still some questions we can resolve here
 -what do they call their products?
 -how can they support assertions that their code is compatible if the 
series of patches they have applied to the codebase are not externally 
visible?

 -what are the concerns of the community about naming and branching?




But because Apache's licensing is so open, Apache will have a hard time 
controlling derivative works...


The Apache license permits anyone to fork and take that fork in house or 
closed source. Most people are considered daft to do this except for 
quick fixes, because any closed source takes on the task of writing the 
functions needed to transform it from the released state to one that 
matches customer needs. (i.e. the working state)




I believe that Steve is incorrect in his assertion concerning potential loss of 
any patent protection. Again Apache's licensing is very open and as long as 
they follow Apache's Ts and Cs, they are covered.


Possibly. I avoid such legal issues.

-steve


MAPREDUCE-5

2011-05-16 Thread Evert Lammerts
When Reducers start running during a certain job 
(mapred.reduce.slowstart.completed.maps = 0.8) it takes about 20 minutes before 
the DN stopd reacting. This seems to be due to a number of Exceptions in the TT 
- at least, it's the only place I'm seeing errors. The three recurring ones are 
getMapOutput, EOFException and IllegalStateException. It seems related to 
https://issues.apache.org/jira/browse/MAPREDUCE-5. See an excerpt from the logs 
attached.

We're running Hadoop 0.20.2 on a 6 node (test) cluster with:

# java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Can anybody shed some light on this?

Thanks a bunch,
Evert
	at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
	at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725)
	... 27 more

2011-05-15 15:50:43,670 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.28.214:50060, dest: 192.168.28.214:52313, bytes: 65536, op: MAPRED_SHUFFLE, cliID: attempt_201105142137_0001_m_000694_0, duration: 190241000
2011-05-15 15:50:43,670 ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
	at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
	at org.mortbay.jetty.Response.sendError(Response.java:240)
	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3718)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
	at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:824)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2011-05-15 15:50:43,731 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201105142137_0001_r_10_0 0.08219343% reduce > copy (685 of 2778 at 3.24 MB/s) > 
2011-05-15 15:50:44,115 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201105142137_0001_m_002268_0 0.978454% 
2011-05-15 15:50:44,134 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.28.214:50060, dest: 192.168.28.213:33477, bytes: 6869628, op: MAPRED_SHUFFLE, cliID: attempt_201105142137_0001_m_000647_0, duration: 4328048000
2011-05-15 15:50:44,238 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.28.214:50060, dest: 192.168.28.213:33477, bytes: 18, op: MAPRED_SHUFFLE, cliID: attempt_201105142137_0001_m_000670_0, duration: 314000
2011-05-15 15:50:44,299 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.28.214:50060, dest: 192.168.28.213:33472, bytes: 7042916, op: MAPRED_SHUFFLE, cliID: attempt_201105142137_0001_m_000671_0, duration: 10883565000
2011-05-15 15:50:44,382 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201105142137_0001_m_002267_0 0.8996846% 
2011-05-15 15:50:44,574 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201105142137_0001_m_000649_0,1) failed :
org.mortbay.jetty.EofException
	at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
	at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
	at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
	at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
	at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3693)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at 

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Allen Wittenauer

On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
> X represents the set of stable releases.
> Y represents the set of available patches.
> C represents the set of Cloudera releases.
> 
> So if C contains a release X(n) plus a set of patches that is contained in Y,
> Then does it not have the right to be considered Apache Hadoop?
> It's my understanding is that any enhancement to Hadoop is made available to 
> Apache and will eventually make it into a later release...

This assumption is probably wrong.  It likely wouldn't be hard to find 
patches made in Cloudera Hadoop that have been rejected from Apache Hadoop.  I 
know some of the code in Cloudera Hadoop 2 was definitely rejected.  If 
Cloudera Hadoop 3's lineage is based upon 2...





Acceptance tests

2011-05-16 Thread Evert Lammerts
Hi all,

What acceptance tests are people using when buying clusters for Hadoop? Any 
pointers to relevant methods?

Thanks,
Evert Lammerts

Re: Acceptance tests

2011-05-16 Thread Allen Wittenauer

On May 16, 2011, at 11:03 AM, Evert Lammerts wrote:

> Hi all,
> 
> What acceptance tests are people using when buying clusters for Hadoop? Any 
> pointers to relevant methods?


We get some test nodes from various manufacturers.  We do some raw IO 
benchmarking vs. our other nodes.  We add them to our various grids to see how 
they perform real world, paying attn to avg task time turn around for certain 
jobs.   Since we know where our current machines are at, we can look at price 
per perf improvements.

Other random things that I think are important:

a) Unless someone shares their entire *-site.xml data, most 
published benchmarks on the net are mostly useless.  Simple things like block 
size have a big impact.

b) Test your actual workload.  Synthetic benchmarks are just 
that--synthetic.  They may not reflect that particular nuances of your job.

c) Establish a baseline. If you have no hardware today, then at 
least establish something on EC2 to compare.

d) Make sure you talk to multiple vendors.

e) Any advice anyone gives you on config is likely going to be 
wrong.

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Eli Collins
On Mon, May 16, 2011 at 10:19 AM, Allen Wittenauer  wrote:
>
> On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
>> X represents the set of stable releases.
>> Y represents the set of available patches.
>> C represents the set of Cloudera releases.
>>
>> So if C contains a release X(n) plus a set of patches that is contained in Y,
>> Then does it not have the right to be considered Apache Hadoop?
>> It's my understanding is that any enhancement to Hadoop is made available to 
>> Apache and will eventually make it into a later release...
>
>        This assumption is probably wrong.  It likely wouldn't be hard to find 
> patches made in Cloudera Hadoop that have been rejected from Apache Hadoop.  
> I know some of the code in Cloudera Hadoop 2 was definitely rejected.  If 
> Cloudera Hadoop 3's lineage is based upon 2...

Allen,

There are few things in Hadoop in CDH that are not in trunk,
branch-20-security, or branch-20-append.  The stuff in this category
is not major (eg HADOOP-6605, better JAVA_HOME detection).

One of the things we and others are busy doing is getting the work
from CDH3 and 20x (formerly YDH) checked into trunk so a future
release won't regress against these 20-based releases.

Most projects in CDH are not heavily patched btw, they're close to an
upstream Apache release.  Hadoop is the exception.
https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases

Thanks,
Eli


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Matthew Foley
It's important to distinguish between the name "Hadoop", which is protected by 
trademark law,
and the Hadoop implementation, which is licensed as opensource under copyright 
law.

The term "derivative work" is, I believe, only relevant under copyright law, 
not trademark law.
(N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.)  
Since the Apache License
explicitly allows derivative works, I don't think it's a useful term for this 
discussion.

However, the ASF, and by delegation the Hadoop PMC, has a lot of control over 
the name,
and how we allow it to be used, under trademark law.  But to keep our rights 
under that
law, we have to enforce the trademark consistently.  So it's good that we're 
having this discussion,
and it's important to reach a conclusion, document it, and enforce it 
consistently.

There are a lot of subtleties; for instance, if I recall correctly from my days 
with Adobe and
PostScript(R), someone who has not licensed a trademark "X" can still claim 
"compatible with X"
as long as they ALSO make clear that the product is NOT, itself, an "X".  But 
you really need
a lawyer to get into that stuff.

--Matt


On May 16, 2011, at 5:00 AM, Segel, Mike wrote:

But Cloudera's release is a bit murky.

The math example is a bit flawed...

X represents the set of stable releases.
Y represents the set of available patches.
C represents the set of Cloudera releases.

So if C contains a release X(n) plus a set of patches that is contained in Y,
Then does it not have the right to be considered Apache Hadoop?
It's my understanding is that any enhancement to Hadoop is made available to 
Apache and will eventually make it into a later release...

So while it may not be 'official' release X(z), all of it's components are in 
Apache.
(note: I'm talking about the core components and not Cloudera's additional 
toolsets that encompass Hadoop.)

Cloudera is clearly a derivative work.
And IMHO is the only one which can say ... 'Includes Apache Hadoop'.

That doesn't mean that others can't, depending on how they implemented their 
changes.
Based on EMC marketing material, they've done a rip and replace of HDFS.
So it wouldn't be a superset since it doesn't contain a complete subset, but 
contains code that implements the API... So they can't say 'Includes Apache 
Hadoop',but they can say it's a derivative work based on Apache Hadoop and then 
go on to show how and why, in their opinion their product is better.(that's 
marketing for you...)

Clearly there are others out there...
Hadoop on Cassandra as an example...

Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the 
table...

But because Apache's licensing is so open, Apache will have a hard time 
controlling derivative works...
I believe that Steve is incorrect in his assertion concerning potential loss of 
any patent protection. Again Apache's licensing is very open and as long as 
they follow Apache's Ts and Cs, they are covered.

Note: because I am sending this from my email address at my client, I am 
obliged to say that this email is my opinion and does not reflect on the 
opinion of my client...
(you know the rest)

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 16, 2011, at 6:02 AM, "Steve Loughran" 
mailto:ste...@apache.org>> wrote:

On 13/05/11 23:57, Allen Wittenauer wrote:

On May 13, 2011, at 3:53 PM, Ted Dunning wrote:

But "distribution Z includes X" kind of implies the existence of some such
that X != Y, Y != empty-set and X+Y = Z, at least in common usage.

Isn't that the same as a non-trunk change?

So doesn't this mean that your question reduces to the question of what
happens when non-Apache changes are made to an Apache release?  And isn't
that the definition of a derived work?


  Yup. Which is why I doubt *any* commercial entity can claim "includes Apache 
Hadoop" (including Cloudera).



but they can claim it is a derivative work, which CDH clearly is,
(Though if we were to come up with a formal declaration of what a
derivative work is, we'd have to handle the fact that it is a superset.
Even worse, you may realise a release is the ordered application of a
sequence of patches, and if the patches are applied in a different order
you may end up with a different body of source code...)

Something that implements the APIs may not be a derivative work,
depending on how much of the original code is in there. You could look
at the base classes and interfaces and produce a clean room
implementation (relying on the notion that interfaces are a list of
facts and not copyrightable in the US), but whoever does that may
encounter the issue that Google's donation of the right to use their MR
patent may not apply to such implementations.


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribu

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Allen Wittenauer

On May 16, 2011, at 2:09 PM, Eli Collins wrote:
> 
> Allen,
> 
> There are few things in Hadoop in CDH that are not in trunk,
> branch-20-security, or branch-20-append.  The stuff in this category
> is not major (eg HADOOP-6605, better JAVA_HOME detection).

But that's my point:  when is it no longer Apache Hadoop?  How major 
does a change need to be under the line?In the case of CDH2 and 3, in order 
to test it out, I actually had to back out some of Cloudera's "improvements" in 
order to even test whereas I didn't under Apache.  Is this another place where 
we only seem to care about APIs and say to hell with the rest of the stack?




Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Eli Collins
On Mon, May 16, 2011 at 2:25 PM, Allen Wittenauer  wrote:
>
> On May 16, 2011, at 2:09 PM, Eli Collins wrote:
>>
>> Allen,
>>
>> There are few things in Hadoop in CDH that are not in trunk,
>> branch-20-security, or branch-20-append.  The stuff in this category
>> is not major (eg HADOOP-6605, better JAVA_HOME detection).
>
>        But that's my point:  when is it no longer Apache Hadoop?  How major 
> does a change need to be under the line?    In the case of CDH2 and 3, in 
> order to test it out, I actually had to back out some of Cloudera's 
> "improvements" in order to even test whereas I didn't under Apache.  Is this 
> another place where we only seem to care about APIs and say to hell with the 
> rest of the stack?
>

I don't think anyone is saying to hell with the rest of the stack, and
 everyone I've spoken to is on-board with a future release that
doesn't require lots of backporting from feature branches.

Thanks,
Eli


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Allen Wittenauer

On May 16, 2011, at 2:29 PM, Eli Collins wrote:

> On Mon, May 16, 2011 at 2:25 PM, Allen Wittenauer  wrote:
>> 
>> On May 16, 2011, at 2:09 PM, Eli Collins wrote:
>>> 
>>> Allen,
>>> 
>>> There are few things in Hadoop in CDH that are not in trunk,
>>> branch-20-security, or branch-20-append.  The stuff in this category
>>> is not major (eg HADOOP-6605, better JAVA_HOME detection).
>> 
>>But that's my point:  when is it no longer Apache Hadoop?  How major 
>> does a change need to be under the line?In the case of CDH2 and 3, in 
>> order to test it out, I actually had to back out some of Cloudera's 
>> "improvements" in order to even test whereas I didn't under Apache.  Is this 
>> another place where we only seem to care about APIs and say to hell with the 
>> rest of the stack?
>> 
> 
> I don't think anyone is saying to hell with the rest of the stack, and
> everyone I've spoken to is on-board with a future release that
> doesn't require lots of backporting from feature branches.

You've missed my point.

Does "Hadoop compatibility" and the ability to say "includes Apache 
Hadoop" only apply when we're talking about MR and HDFS APIs?  

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Ian Holsman

> 
>   Does "Hadoop compatibility" and the ability to say "includes Apache 
> Hadoop" only apply when we're talking about MR and HDFS APIs?  


It is confusing isn't it.

We could go down the route java did and say that the API's are 'hadoop' and 
ours is just a reference implementation of it. (but others pointed out, we 
don't want to become a certification group)

Out of curiosity, how good is our test suite in exercising our APIs? 
Is it sophisticated enough to capture someone adding a functionality-changing 
patch (eg the append one). and have it flag it as a test-failure? 



Re: Apache Hadoop Hackathon: 5/18 in Palo Alto and San Francisco

2011-05-16 Thread Jeff Hammerbacher
Hey,

We've got a great group coming together again on Wednesday for an Apache
Hadoop Hackathon in Palo Alto and San Francisco. Sign up at
http://hadoophackathon.eventbrite.com.

As a reminder, we'll have Nigel Daley, the release manager for 0.22, present
in Palo Alto. If you have build and release or testing skills and would like
to contribute to Apache Hadoop, your skills will be highly valued. Please
consider coming out to meet Nigel and help out on that front!

Also, there's a Hadoop User Group at Yahoo! in Sunnyvale on Wednesday
evening: http://www.meetup.com/hadoop/events/16805258. There will be a large
caravan leaving from the Palo Alto Hackathon to attend the HUG, so if you
want to Hadoop all day, stop by Palo Alto and join us for the trip south.

Regards,
Jeff

On Thu, May 12, 2011 at 2:40 PM, Jeff Hammerbacher wrote:

> Hey,
>
> Thanks to everyone who came out for the Apache Hadoop Hackathon yesterday
> in Palo Alto and San Francisco. We had 35 people sign up from a great cross
> section of companies: Yahoo!, Cloudera, Facebook, Apple, Twitter,
> Foursquare, AOL, Ngmoco, StumbleUpon, Trend Micro, Conviva, and more. We had
> committers from the HBase, Hive, Pig, and Oozie projects ensuring their
> projects work with the upcoming 0.22 release, and a number of folks got
> their first patches up. The 0.22 branch is now being built by Jenkins and
> we're in good shape to get a release candidate up soon.
>
> We had so much fun that we're going to do it again!
>
> Join us next Wednesday, May 18th, in either San Francisco or Palo Alto, and
> help us continue the march to get feature development back on trunk. We'll
> have a special guest in Palo Alto: Nigel Daley, Apache Hadoop PMC member and
> the release manager for the 0.22 release. If you're a sysadmin or devops
> person and want to help us build and test Apache Hadoop, this Hackathon will
> be a great chance to learn about how it's done. Of course we'd also love to
> have developers looking to get their first patches into Hadoop as well.
>
> Sign up at http://hadoophackathon.eventbrite.com and we'll see you
> Wednesday.
>
> Regards,
> Jeff
>
> p.s. if you're looking to prepare for the Hackathon, check out the issues
> tagged "newbie" at
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=labels+%3D+newbie
> .
>


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Segel, Mike
I just checked... TESS said no trademarks for Hadoop.
So... what TM protection? :-)

You are correct about derivative works. It's a moot point as long as the 
derivative work follows the T&Cs...



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 16, 2011, at 4:18 PM, "Matthew Foley"  wrote:

> It's important to distinguish between the name "Hadoop", which is protected 
> by trademark law,
> and the Hadoop implementation, which is licensed as opensource under 
> copyright law.
> 
> The term "derivative work" is, I believe, only relevant under copyright law, 
> not trademark law.
> (N.B., I'm not a lawyer -- and this email is my opinion, not my employer's.)  
> Since the Apache License
> explicitly allows derivative works, I don't think it's a useful term for this 
> discussion.
> 
> However, the ASF, and by delegation the Hadoop PMC, has a lot of control over 
> the name,
> and how we allow it to be used, under trademark law.  But to keeps our rights 
> under that
> law, we have to enforce the trademark consistently.  So it's good that we're 
> having this discussion,
> and it's important to reach a conclusion, document it, and enforce it 
> consistently.
> 
> There are a lot of subtleties; for instance, if I recall correctly from my 
> days with Adobe and
> PostScript(R), someone who has not licensed a trademark "X" can still claim 
> "compatible with X"
> as long as they ALSO make clear that the product is NOT, itself, an "X".  But 
> you really need
> a lawyer to get into that stuff.
> 
> --Matt
> 
> 
> On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
> 
> But Cloudera's release is a bit murky.
> 
> The math example is a bit flawed...
> 
> X represents the set of stable releases.
> Y represents the set of available patches.
> C represents the set of Cloudera releases.
> 
> So if C contains a release X(n) plus a set of patches that is contained in Y,
> Then does it not have the right to be considered Apache Hadoop?
> It's my understanding is that any enhancement to Hadoop is made available to 
> Apache and will eventually make it into a later release...
> 
> So while it may not be 'official' release X(z), all of it's components are in 
> Apache.
> (note: I'm talking about the core components and not Cloudera's additional 
> toolsets that encompass Hadoop.)
> 
> Cloudera is clearly a derivative work.
> And IMHO is the only one which can say ... 'Includes Apache Hadoop'.
> 
> That doesn't mean that others can't, depending on how they implemented their 
> changes.
> Based on EMC marketing material, they've done a rip and replace of HDFS.
> So it wouldn't be a superset since it doesn't contain a complete subset, but 
> contains code that implements the API... So they can't say 'Includes Apache 
> Hadoop',but they can say it's a derivative work based on Apache Hadoop and 
> then go on to show how and why, in their opinion their product is 
> better.(that's marketing for you...)
> 
> Clearly there are others out there...
> Hadoop on Cassandra as an example...
> 
> Fragmentation of Hadoop will occur. It's inevitable. Too much money is on the 
> table...
> 
> But because Apache's licensing is so open, Apache will have a hard time 
> controlling derivative works...
> I believe that Steve is incorrect in his assertion concerning potential loss 
> of any patent protection. Again Apache's licensing is very open and as long 
> as they follow Apache's Ts and Cs, they are covered.
> 
> Note: because I am sending this from my email address at my client, I am 
> obliged to say that this email is my opinion and does not reflect on the 
> opinion of my client...
> (you know the rest)
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On May 16, 2011, at 6:02 AM, "Steve Loughran" 
> mailto:ste...@apache.org>> wrote:
> 
> On 13/05/11 23:57, Allen Wittenauer wrote:
> 
> On May 13, 2011, at 3:53 PM, Ted Dunning wrote:
> 
> But "distribution Z includes X" kind of implies the existence of some such
> that X != Y, Y != empty-set and X+Y = Z, at least in common usage.
> 
> Isn't that the same as a non-trunk change?
> 
> So doesn't this mean that your question reduces to the question of what
> happens when non-Apache changes are made to an Apache release?  And isn't
> that the definition of a derived work?
> 
> 
>  Yup. Which is why I doubt *any* commercial entity can claim "includes Apache 
> Hadoop" (including Cloudera).
> 
> 
> 
> but they can claim it is a derivative work, which CDH clearly is,
> (Though if we were to come up with a formal declaration of what a
> derivative work is, we'd have to handle the fact that it is a superset.
> Even worse, you may realise a release is the ordered application of a
> sequence of patches, and if the patches are applied in a different order
> you may end up with a different body of source code...)
> 
> Something that implements the APIs may not be a derivative work,
> depending on how much of the original code is in there. You could l

Re: Apache Hadoop Hackathon: 5/18 in Palo Alto and San Francisco

2011-05-16 Thread Joe Stein
Any chance for something in the east (NYC) or do I need to start nagging the
wife and kids that west coast weather is the way to go?

I will post on the NYC HUG maybe we can get some Hack together contrib to
Hadoop but maybe some evening/day that a few commiters on this list that
are in NYC (not sure if there are any based in NYC?) they can run a mini
hackathon to help get folks in contrib mode? that would rock!

Maybe even some standard type of "Hack @ HUG" that every HUG can do with
their users (build, patch, build, deploy, test) if they want...

Thanks to all contributors and commiters for your hard work and continued
efforts/dedication.

On Mon, May 16, 2011 at 8:09 PM, Jeff Hammerbacher wrote:

> Hey,
>
> We've got a great group coming together again on Wednesday for an Apache
> Hadoop Hackathon in Palo Alto and San Francisco. Sign up at
> http://hadoophackathon.eventbrite.com.
>
> As a reminder, we'll have Nigel Daley, the release manager for 0.22,
> present
> in Palo Alto. If you have build and release or testing skills and would
> like
> to contribute to Apache Hadoop, your skills will be highly valued. Please
> consider coming out to meet Nigel and help out on that front!
>
> Also, there's a Hadoop User Group at Yahoo! in Sunnyvale on Wednesday
> evening: http://www.meetup.com/hadoop/events/16805258. There will be a
> large
> caravan leaving from the Palo Alto Hackathon to attend the HUG, so if you
> want to Hadoop all day, stop by Palo Alto and join us for the trip south.
>
> Regards,
> Jeff
>
> On Thu, May 12, 2011 at 2:40 PM, Jeff Hammerbacher  >wrote:
>
> > Hey,
> >
> > Thanks to everyone who came out for the Apache Hadoop Hackathon yesterday
> > in Palo Alto and San Francisco. We had 35 people sign up from a great
> cross
> > section of companies: Yahoo!, Cloudera, Facebook, Apple, Twitter,
> > Foursquare, AOL, Ngmoco, StumbleUpon, Trend Micro, Conviva, and more. We
> had
> > committers from the HBase, Hive, Pig, and Oozie projects ensuring their
> > projects work with the upcoming 0.22 release, and a number of folks got
> > their first patches up. The 0.22 branch is now being built by Jenkins and
> > we're in good shape to get a release candidate up soon.
> >
> > We had so much fun that we're going to do it again!
> >
> > Join us next Wednesday, May 18th, in either San Francisco or Palo Alto,
> and
> > help us continue the march to get feature development back on trunk.
> We'll
> > have a special guest in Palo Alto: Nigel Daley, Apache Hadoop PMC member
> and
> > the release manager for the 0.22 release. If you're a sysadmin or devops
> > person and want to help us build and test Apache Hadoop, this Hackathon
> will
> > be a great chance to learn about how it's done. Of course we'd also love
> to
> > have developers looking to get their first patches into Hadoop as well.
> >
> > Sign up at http://hadoophackathon.eventbrite.com and we'll see you
> > Wednesday.
> >
> > Regards,
> > Jeff
> >
> > p.s. if you're looking to prepare for the Hackathon, check out the issues
> > tagged "newbie" at
> >
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=labels+%3D+newbie
> > .
> >
>



-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Scott Carey
On trademarks, what about the phrase:  "New distribution for Apache
Hadoop"?  I've seen that used, and its something that replaces most of the
stack.  I believe "Apache Hadoop" is trademarked in this context, even if
Hadoop alone isn't.
"Compatible with Apache Hadoop" is a smaller issue, defining some rough
guidelines for various forms of compatibility is useful for the community
(and reputable vendors), abuse of that will at least become obvious.  But
"distribution for Apache Hadoop" (not too sure what 'for' means here)?  Is
there any TM protection?  A proprietary derivative work with most of the
guts replaced is not an Apache Hadoop distribution, nor a distribution for
Apache Hadoop.

On 5/16/11 5:40 PM, "Segel, Mike"  wrote:

>I just checked... TESS said no trademarks for Hadoop.
>So... what TM protection? :-)
>
>You are correct about derivative works. It's a moot point as long as the
>derivative work follows the T&Cs...
>
>
>
>Sent from a remote device. Please excuse any typos...
>
>Mike Segel
>
>On May 16, 2011, at 4:18 PM, "Matthew Foley"  wrote:
>
>> It's important to distinguish between the name "Hadoop", which is
>>protected by trademark law,
>> and the Hadoop implementation, which is licensed as opensource under
>>copyright law.
>> 
>> The term "derivative work" is, I believe, only relevant under copyright
>>law, not trademark law.
>> (N.B., I'm not a lawyer -- and this email is my opinion, not my
>>employer's.)  Since the Apache License
>> explicitly allows derivative works, I don't think it's a useful term
>>for this discussion.
>> 
>> However, the ASF, and by delegation the Hadoop PMC, has a lot of
>>control over the name,
>> and how we allow it to be used, under trademark law.  But to keeps our
>>rights under that
>> law, we have to enforce the trademark consistently.  So it's good that
>>we're having this discussion,
>> and it's important to reach a conclusion, document it, and enforce it
>>consistently.
>> 
>> There are a lot of subtleties; for instance, if I recall correctly from
>>my days with Adobe and
>> PostScript(R), someone who has not licensed a trademark "X" can still
>>claim "compatible with X"
>> as long as they ALSO make clear that the product is NOT, itself, an
>>"X".  But you really need
>> a lawyer to get into that stuff.
>> 
>> --Matt
>> 
>> 
>> On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
>> 
>> But Cloudera's release is a bit murky.
>> 
>> The math example is a bit flawed...
>> 
>> X represents the set of stable releases.
>> Y represents the set of available patches.
>> C represents the set of Cloudera releases.
>> 
>> So if C contains a release X(n) plus a set of patches that is contained
>>in Y,
>> Then does it not have the right to be considered Apache Hadoop?
>> It's my understanding is that any enhancement to Hadoop is made
>>available to Apache and will eventually make it into a later release...
>> 
>> So while it may not be 'official' release X(z), all of it's components
>>are in Apache.
>> (note: I'm talking about the core components and not Cloudera's
>>additional toolsets that encompass Hadoop.)
>> 
>> Cloudera is clearly a derivative work.
>> And IMHO is the only one which can say ... 'Includes Apache Hadoop'.
>> 
>> That doesn't mean that others can't, depending on how they implemented
>>their changes.
>> Based on EMC marketing material, they've done a rip and replace of HDFS.
>> So it wouldn't be a superset since it doesn't contain a complete
>>subset, but contains code that implements the API... So they can't say
>>'Includes Apache Hadoop',but they can say it's a derivative work based
>>on Apache Hadoop and then go on to show how and why, in their opinion
>>their product is better.(that's marketing for you...)
>> 
>> Clearly there are others out there...
>> Hadoop on Cassandra as an example...
>> 
>> Fragmentation of Hadoop will occur. It's inevitable. Too much money is
>>on the table...
>> 
>> But because Apache's licensing is so open, Apache will have a hard time
>>controlling derivative works...
>> I believe that Steve is incorrect in his assertion concerning potential
>>loss of any patent protection. Again Apache's licensing is very open and
>>as long as they follow Apache's Ts and Cs, they are covered.
>> 
>> Note: because I am sending this from my email address at my client, I
>>am obliged to say that this email is my opinion and does not reflect on
>>the opinion of my client...
>> (you know the rest)
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 16, 2011, at 6:02 AM, "Steve Loughran"
>>mailto:ste...@apache.org>> wrote:
>> 
>> On 13/05/11 23:57, Allen Wittenauer wrote:
>> 
>> On May 13, 2011, at 3:53 PM, Ted Dunning wrote:
>> 
>> But "distribution Z includes X" kind of implies the existence of some
>>such
>> that X != Y, Y != empty-set and X+Y = Z, at least in common usage.
>> 
>> Isn't that the same as a non-trunk change?
>> 
>> So doesn't this mean that your question reduces to the question of what

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Segel, Mike
Let me clarify...
I searched on Hadoop as a term in any TM. 
Nothing came back...

This means that Apache Hadoop didn't show up.

Note the following: I did the basic search. I wouldn't be surprised that 
someone from Apache comes back and says see TM  ...

-Mike

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 16, 2011, at 8:12 PM, Scott Carey  wrote:

> On trademarks, what about the phrase:  "New distribution for Apache
> Hadoop"?  I've seen that used, and its something that replaces most of the
> stack.  I believe "Apache Hadoop" is trademarked in this context, even if
> Hadoop alone isn't.
> "Compatible with Apache Hadoop" is a smaller issue, defining some rough
> guidelines for various forms of compatibility is useful for the community
> (and reputable vendors), abuse of that will at least become obvious.  But
> "distribution for Apache Hadoop" (not too sure what 'for' means here)?  Is
> there any TM protection?  A proprietary derivative work with most of the
> guts replaced is not an Apache Hadoop distribution, nor a distribution for
> Apache Hadoop.
> 
> On 5/16/11 5:40 PM, "Segel, Mike"  wrote:
> 
>> I just checked... TESS said no trademarks for Hadoop.
>> So... what TM protection? :-)
>> 
>> You are correct about derivative works. It's a moot point as long as the
>> derivative work follows the T&Cs...
>> 
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 16, 2011, at 4:18 PM, "Matthew Foley"  wrote:
>> 
>>> It's important to distinguish between the name "Hadoop", which is
>>> protected by trademark law,
>>> and the Hadoop implementation, which is licensed as opensource under
>>> copyright law.
>>> 
>>> The term "derivative work" is, I believe, only relevant under copyright
>>> law, not trademark law.
>>> (N.B., I'm not a lawyer -- and this email is my opinion, not my
>>> employer's.)  Since the Apache License
>>> explicitly allows derivative works, I don't think it's a useful term
>>> for this discussion.
>>> 
>>> However, the ASF, and by delegation the Hadoop PMC, has a lot of
>>> control over the name,
>>> and how we allow it to be used, under trademark law.  But to keeps our
>>> rights under that
>>> law, we have to enforce the trademark consistently.  So it's good that
>>> we're having this discussion,
>>> and it's important to reach a conclusion, document it, and enforce it
>>> consistently.
>>> 
>>> There are a lot of subtleties; for instance, if I recall correctly from
>>> my days with Adobe and
>>> PostScript(R), someone who has not licensed a trademark "X" can still
>>> claim "compatible with X"
>>> as long as they ALSO make clear that the product is NOT, itself, an
>>> "X".  But you really need
>>> a lawyer to get into that stuff.
>>> 
>>> --Matt
>>> 
>>> 
>>> On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
>>> 
>>> But Cloudera's release is a bit murky.
>>> 
>>> The math example is a bit flawed...
>>> 
>>> X represents the set of stable releases.
>>> Y represents the set of available patches.
>>> C represents the set of Cloudera releases.
>>> 
>>> So if C contains a release X(n) plus a set of patches that is contained
>>> in Y,
>>> Then does it not have the right to be considered Apache Hadoop?
>>> It's my understanding is that any enhancement to Hadoop is made
>>> available to Apache and will eventually make it into a later release...
>>> 
>>> So while it may not be 'official' release X(z), all of it's components
>>> are in Apache.
>>> (note: I'm talking about the core components and not Cloudera's
>>> additional toolsets that encompass Hadoop.)
>>> 
>>> Cloudera is clearly a derivative work.
>>> And IMHO is the only one which can say ... 'Includes Apache Hadoop'.
>>> 
>>> That doesn't mean that others can't, depending on how they implemented
>>> their changes.
>>> Based on EMC marketing material, they've done a rip and replace of HDFS.
>>> So it wouldn't be a superset since it doesn't contain a complete
>>> subset, but contains code that implements the API... So they can't say
>>> 'Includes Apache Hadoop',but they can say it's a derivative work based
>>> on Apache Hadoop and then go on to show how and why, in their opinion
>>> their product is better.(that's marketing for you...)
>>> 
>>> Clearly there are others out there...
>>> Hadoop on Cassandra as an example...
>>> 
>>> Fragmentation of Hadoop will occur. It's inevitable. Too much money is
>>> on the table...
>>> 
>>> But because Apache's licensing is so open, Apache will have a hard time
>>> controlling derivative works...
>>> I believe that Steve is incorrect in his assertion concerning potential
>>> loss of any patent protection. Again Apache's licensing is very open and
>>> as long as they follow Apache's Ts and Cs, they are covered.
>>> 
>>> Note: because I am sending this from my email address at my client, I
>>> am obliged to say that this email is my opinion and does not reflect on
>>> the opinion of my client...
>>> (you know the rest)
>>

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Konstantin Boudnik
We have the following method coverage:
Common ~60%
HDFS  ~80%
MR  ~70%
(better analysis will be available after our projects are connected to
Sonar, I think).

While method coverage isn't completely adequate answer to your
question, I'd say there is a possibility to sneak in some semantical
and even API changes which might go entirely unvalidated by the test
suites. It isn't very high, but it does exist.

A better approach to validate semantics is to run cluster tests (e.g.
system tests) which have a better potentials to exercise public APIs
than functional tests. There's HADOOP-7278 to address this for 0.22
(and potentially others)
--
  Take care,
Konstantin (Cos) Boudnik

Disclaimer: Opinions expressed in this email are those of the author,
and do not necessarily represent the views of any company the author
might be affiliated with at the moment of writing.

On Mon, May 16, 2011 at 14:59, Ian Holsman  wrote:
>
>>
>>       Does "Hadoop compatibility" and the ability to say "includes Apache 
>> Hadoop" only apply when we're talking about MR and HDFS APIs?
>
>
> It is confusing isn't it.
>
> We could go down the route java did and say that the API's are 'hadoop' and 
> ours is just a reference implementation of it. (but others pointed out, we 
> don't want to become a certification group)
>
> Out of curiosity, how good is our test suite in exercising our APIs?
> Is it sophisticated enough to capture someone adding a functionality-changing 
> patch (eg the append one). and have it flag it as a test-failure?
>
>


Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Eric Baldeschwieler
My understanding is that a history if defending your trade mark is more 
important than registration. Apache does defend Hadoop. 

---
E14 - typing on glass

On May 16, 2011, at 6:52 PM, "Segel, Mike"  wrote:

> Let me clarify...
> I searched on Hadoop as a term in any TM. 
> Nothing came back...
> 
> This means that Apache Hadoop didn't show up.
> 
> Note the following: I did the basic search. I wouldn't be surprised that 
> someone from Apache comes back and says see TM  ...
> 
> -Mike
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On May 16, 2011, at 8:12 PM, Scott Carey  wrote:
> 
>> On trademarks, what about the phrase:  "New distribution for Apache
>> Hadoop"?  I've seen that used, and its something that replaces most of the
>> stack.  I believe "Apache Hadoop" is trademarked in this context, even if
>> Hadoop alone isn't.
>> "Compatible with Apache Hadoop" is a smaller issue, defining some rough
>> guidelines for various forms of compatibility is useful for the community
>> (and reputable vendors), abuse of that will at least become obvious.  But
>> "distribution for Apache Hadoop" (not too sure what 'for' means here)?  Is
>> there any TM protection?  A proprietary derivative work with most of the
>> guts replaced is not an Apache Hadoop distribution, nor a distribution for
>> Apache Hadoop.
>> 
>> On 5/16/11 5:40 PM, "Segel, Mike"  wrote:
>> 
>>> I just checked... TESS said no trademarks for Hadoop.
>>> So... what TM protection? :-)
>>> 
>>> You are correct about derivative works. It's a moot point as long as the
>>> derivative work follows the T&Cs...
>>> 
>>> 
>>> 
>>> Sent from a remote device. Please excuse any typos...
>>> 
>>> Mike Segel
>>> 
>>> On May 16, 2011, at 4:18 PM, "Matthew Foley"  wrote:
>>> 
 It's important to distinguish between the name "Hadoop", which is
 protected by trademark law,
 and the Hadoop implementation, which is licensed as opensource under
 copyright law.
 
 The term "derivative work" is, I believe, only relevant under copyright
 law, not trademark law.
 (N.B., I'm not a lawyer -- and this email is my opinion, not my
 employer's.)  Since the Apache License
 explicitly allows derivative works, I don't think it's a useful term
 for this discussion.
 
 However, the ASF, and by delegation the Hadoop PMC, has a lot of
 control over the name,
 and how we allow it to be used, under trademark law.  But to keeps our
 rights under that
 law, we have to enforce the trademark consistently.  So it's good that
 we're having this discussion,
 and it's important to reach a conclusion, document it, and enforce it
 consistently.
 
 There are a lot of subtleties; for instance, if I recall correctly from
 my days with Adobe and
 PostScript(R), someone who has not licensed a trademark "X" can still
 claim "compatible with X"
 as long as they ALSO make clear that the product is NOT, itself, an
 "X".  But you really need
 a lawyer to get into that stuff.
 
 --Matt
 
 
 On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
 
 But Cloudera's release is a bit murky.
 
 The math example is a bit flawed...
 
 X represents the set of stable releases.
 Y represents the set of available patches.
 C represents the set of Cloudera releases.
 
 So if C contains a release X(n) plus a set of patches that is contained
 in Y,
 Then does it not have the right to be considered Apache Hadoop?
 It's my understanding is that any enhancement to Hadoop is made
 available to Apache and will eventually make it into a later release...
 
 So while it may not be 'official' release X(z), all of it's components
 are in Apache.
 (note: I'm talking about the core components and not Cloudera's
 additional toolsets that encompass Hadoop.)
 
 Cloudera is clearly a derivative work.
 And IMHO is the only one which can say ... 'Includes Apache Hadoop'.
 
 That doesn't mean that others can't, depending on how they implemented
 their changes.
 Based on EMC marketing material, they've done a rip and replace of HDFS.
 So it wouldn't be a superset since it doesn't contain a complete
 subset, but contains code that implements the API... So they can't say
 'Includes Apache Hadoop',but they can say it's a derivative work based
 on Apache Hadoop and then go on to show how and why, in their opinion
 their product is better.(that's marketing for you...)
 
 Clearly there are others out there...
 Hadoop on Cassandra as an example...
 
 Fragmentation of Hadoop will occur. It's inevitable. Too much money is
 on the table...
 
 But because Apache's licensing is so open, Apache will have a hard time
 controlling derivative works...
 I believe that Steve is incorrect in his assertion concerning potential
 loss of 

Re: Defining Hadoop Compatibility -revisiting-

2011-05-16 Thread Andrew Purtell
> On trademarks, what about the phrase:  "New distribution for Apache
> Hadoop"?  I've seen that used, and its something that
> replaces most of the stack. [...] A proprietary derivative work with
> most of the guts replaced is not an Apache Hadoop distribution, nor
> a distribution for Apache Hadoop.

IMHO, this is the key issue. Allowing proprietary derivative works that provide 
Hadoop compatible APIs to claim they are Hadoop will provoke endless confusion, 
argument, claim, and counter-claim, and poison the well for all involved with 
Apache Hadoop.

Best regards,

- Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


--- On Mon, 5/16/11, Scott Carey  wrote:

> From: Scott Carey 
> Subject: Re: Defining Hadoop Compatibility -revisiting-
> To: "general@hadoop.apache.org" 
> Cc: "Matthew Foley" 
> Date: Monday, May 16, 2011, 6:12 PM
> On trademarks, what about the phrase:  "New distribution for Apache
> Hadoop"?  I've seen that used, and its something that replaces most
> of the stack.  I believe "Apache Hadoop" is trademarked in this
> context, even if Hadoop alone isn't. "Compatible with Apache Hadoop"
> is a smaller issue, defining some rough guidelines for various forms
> of compatibility is useful for the community (and reputable vendors),
> abuse of that will at least become obvious.  But "distribution for
> Apache Hadoop" (not too sure what 'for' means here)?  Is there any
> TM protection?  A proprietary derivative work with most of the
> guts replaced is not an Apache Hadoop distribution, nor a
> distribution for Apache Hadoop.
> 
> On 5/16/11 5:40 PM, "Segel, Mike" 
> wrote:
> 
> >I just checked... TESS said no trademarks for Hadoop.
> >So... what TM protection? :-)
> >
> >You are correct about derivative works. It's a moot
> point as long as the
> >derivative work follows the T&Cs...
> >
> >
> >
> >Sent from a remote device. Please excuse any typos...
> >
> >Mike Segel
> >
> >On May 16, 2011, at 4:18 PM, "Matthew Foley" 
> wrote:
> >
> >> It's important to distinguish between the name
> "Hadoop", which is
> >>protected by trademark law,
> >> and the Hadoop implementation, which is licensed
> as opensource under
> >>copyright law.
> >> 
> >> The term "derivative work" is, I believe, only
> relevant under copyright
> >>law, not trademark law.
> >> (N.B., I'm not a lawyer -- and this email is my
> opinion, not my
> >>employer's.)  Since the Apache License
> >> explicitly allows derivative works, I don't think
> it's a useful term
> >>for this discussion.
> >> 
> >> However, the ASF, and by delegation the Hadoop
> PMC, has a lot of
> >>control over the name,
> >> and how we allow it to be used, under trademark
> law.  But to keeps our
> >>rights under that
> >> law, we have to enforce the trademark
> consistently.  So it's good that
> >>we're having this discussion,
> >> and it's important to reach a conclusion, document
> it, and enforce it
> >>consistently.
> >> 
> >> There are a lot of subtleties; for instance, if I
> recall correctly from
> >>my days with Adobe and
> >> PostScript(R), someone who has not licensed a
> trademark "X" can still
> >>claim "compatible with X"
> >> as long as they ALSO make clear that the product
> is NOT, itself, an
> >>"X".  But you really need
> >> a lawyer to get into that stuff.
> >> 
> >> --Matt
> >> 
> >> 
> >> On May 16, 2011, at 5:00 AM, Segel, Mike wrote:
> >> 
> >> But Cloudera's release is a bit murky.
> >> 
> >> The math example is a bit flawed...
> >> 
> >> X represents the set of stable releases.
> >> Y represents the set of available patches.
> >> C represents the set of Cloudera releases.
> >> 
> >> So if C contains a release X(n) plus a set of
> patches that is contained
> >>in Y,
> >> Then does it not have the right to be considered
> Apache Hadoop?
> >> It's my understanding is that any enhancement to
> Hadoop is made
> >>available to Apache and will eventually make it
> into a later release...
> >> 
> >> So while it may not be 'official' release X(z),
> all of it's components
> >>are in Apache.
> >> (note: I'm talking about the core components and
> not Cloudera's
> >>additional toolsets that encompass Hadoop.)
> >> 
> >> Cloudera is clearly a derivative work.
> >> And IMHO is the only one which can say ...
> 'Includes Apache Hadoop'.
> >> 
> >> That doesn't mean that others can't, depending on
> how they implemented
> >>their changes.
> >> Based on EMC marketing material, they've done a
> rip and replace of HDFS.
> >> So it wouldn't be a superset since it doesn't
> contain a complete
> >>subset, but contains code that implements the
> API... So they can't say
> >>'Includes Apache Hadoop',but they can say it's a
> derivative work based
> >>on Apache Hadoop and then go on to show how and
> why, in their opinion
> >>their product is better.(that's marketing for
> you...)
> >> 
> >> Clearly there are others out there...
> >> Hadoop on Cassandra as an example...
> >> 
> >> Fragmentation of Hadoop will