Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Michael G. Noll Sat, 01 Mar 2014 02:14:24 -0800

Thanks for starting this discussion, Taylor.

As a user of Storm (and a small-scale contributor to storm-starter) as
well as a user of Kafka, here are my $.02.


[Storm and Kafka]
First, I agree with Nathan that storm-kafka should be considered to be
brought in.  While various "integrate Storm with X" options exist,
basically everyone I have been talking to is using Kafka in
combination with Storm.  I'm sure this is not a representative sample
of Storm users, and of course one may or may not agree that Kafka is
important enough of a technology in Storm's ecosystem.  Still, I do
see the need to make sure Storm and Kafka do work together without
having to go through forks of forks on GitHub and spending days to
figure out how to get data from Kafka (0.8) into Storm.
    Speaking of Kafka spout implementations, please don't forget
https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
 We've been quite happy with the former, so I'd suggest to at least
consider both options here (maybe the two projects can even join forces?).

[Storm examples, storm-starter]
Second, IMHO every open source project should have a "1-click starting
experience" for new users.  That's very much related to the project
principles of tools like LogStash [1] who say: "Community: If a newbie
has a bad time, it's a bug."  For this reason I personally would like
to see the equivalent of storm-starter being brought into the "core"
Storm project -- think of an examples/ sub-module.  If the level of
effort is deemed too high to e.g. maintain what's already in
storm-starter, then (say) reduce the scope and remove some of the
examples.  In any case I'd personally would like to see bundled
examples that are known to work with the latest version of Storm.
storm-starter is often used to show new users how to get started with
Storm (I used that approach in my Storm blog posts, for instance, and
others like Mesosphere.io are even using storm-starter for their
commercial offerings [2]).

[Have Storm up and running faster than you can brew an espresso]
Third, for the same reason (get people up and running in a few
minutes), I do like that other people in this thread have been
bringing up projects like storm-deploy.  For the same reason I have
open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
few days ago, and I'll soon open source another Vagrant/Puppet based
tool that provides you with 1-click local and remote deployments of
Storm and Kafka clusters.  That's way better IMHO than having to
follow long articles or blog posts to deploy your first cluster.  And
there are a number of other people that have been rolling their own
variants.  Now don't get me wrong -- I don't mention this to pitch any
of those tools.  My intention is to say that it would be greatly
helpful to have /something/ like this for Storm, for the same reason
that it's nice to have LocalCluster for unit testing.  I have been
demo'ing both Storm and Kafka by launching clusters with a simple
command line, which always gets people excited.  If they can then rely
on existing examples (see above) to also /run/ an analysis on "their"
cluster then they have a beautiful start.
    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
VM cluster setup, too [4] so that people can run the Aurora tutorial
on their machines in a few minutes.

[Storm and YARN]
Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
would be nice.  It ties into being able to run LocalCluster as well as
to run Storm in local or remote VMs -- but now alongside your existing
Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
will surely be similarly attractive.


On a related note bringing the Storm docs up to speed with the quality
of the Storm code would also be great.  I have seen that since Storm
moved to Incubator several new sections have been added such as the
FAQ [5] (btw: nice!).

Similarly, there should be better examples and docs for users how to
write unit tests for Storm.  Right now people seem to be cobbling
together their test code by figuring out how the 1-year old code in
[6] actually works, and copy-pasting other people's test code from GitHub.

--

As I said above, these are my personal $.02.  I admit that my comments
go a bit beyond the original question of bringing in contrib modules
-- it think implicitly the discussion about the contrib modules also
means "what do you need to provide a better and more well-rounded
experience", i.e. the question whether to have batteries included or
not. (As you may suspect I'm leaning towards included at least the
most important batteries, though what's really "important" for on the
project-level is of course up to debate.)

On my side I'd be happy to help with those areas where I am able to
contribute, whether that's code and examples (like storm-starter) or
tutorials/docs (I already wrote e.g. [7] and [8]).

Again, thanks Taylor for starting this discussion.  No matter the
actual outcome I'm sure the state of the project will be improved.

Best,
Michael



[1] https://github.com/elasticsearch/logstash
[2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
[3] https://github.com/miguno/puppet-storm
[4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
[5] http://storm.incubator.apache.org/documentation/FAQ.html
[6]
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
[7]
https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
[8]
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/



On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> Thanks for the feedback Bobby.
> 
> To clarify, I’m mainly talking about spout/bolt/trident state 
> implementations that integrate storm with *Technology X*, where 
> *Technology X* is not a fundamental part of storm.
> 
> Examples would be technologies that are part of or related to the 
> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
> Kafka, HDFS, HBase, Cassandra, etc.
> 
> The idea behind having one or more Storm committers act as a
> “sponsor” is to make sure new additions are done carefully and with
> good reason. To add a new module, it would require committer/PPMC
> consensus, and assignment of one or more sponsors. Part of a
> sponsor’s job would be to ensure that a module is maintained, which
> would require enough familiarity with the code so support it long
> term. If a new module was proposed, but no committers were willing
> to act as a sponsor, it would not be added.
> 
> It would be the Committers’/PPMC’s responsibly to make sure things 
> didn’t get out of hand, and to do something about it if it does.
> 
> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> Hive as a contrib module, similar to what happened with HBase as
> Bobby pointed out. Some interesting points are brought up. The
> difference here is that both HBase and Hive were pretty big
> codebases relative to Hadoop. With spout/bolt/state implementations
> I doubt we’d see anything along that scale.
> 
> - Taylor
> 
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> 
> 
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com 
> <mailto:ev...@yahoo-inc.com>> wrote:
> 
>> I can see a lot of value in having a distribution of storm that
>> comes with batteries included, everything is tested together and
>> you know it works.  But I don’t see much long term developer
>> benefit in building them all together.  If there is strong
>> coupling between storm and these external projects so that they
>> break when storm changes then we need to understand the coupling
>> and decide if we want to reduce that coupling by stabilizing
>> APIs, improving version numbering and release process, etc.; or
>> if the functionality is something that should be offered as a
>> base service in storm.
>> 
>> I can see politically the value of giving these other projects a
>> home in Apache, and making them sub-projects is the simplest
>> route to that. I’d love to have storm on yarn inside Apache.  I
>> just don’t want to go overboard with it.  There was a time when
>> HBase was a “contrib” module under Hadoop along with a lot of
>> other things, and the Apache board came and told Hadoop to brake
>> it up.
>> 
>> Bringing storm-kafka into storm does not sound like it will solve
>> much from a developer’s perspective, because there is at least as
>> much coupling with kafka as there is with storm.  I can see how
>> it is a huge amount of overhead and pain to set up a new project
>> just for a few hundred lines of code, as such I am in favor of
>> pulling in closely related projects, especially those that are
>> spouts and state implementations. I just want to be sure that we
>> do it carefully, with a good reason, and with enough people who
>> are familiar with the code to support it long term.
>> 
>> If it starts to look like we are pulling in too many projects
>> perhaps we should look at something more like the bigtop project 
>> https://bigtop.apache.org/ which produces a tested distribution
>> of Hadoop with many different sub-projects included in it.
>> 
>> I am also a bit concerned about these sub-projects becoming
>> second class citizens, where we break something, but because the
>> build is off by default we don’t know it.  I would prefer that
>> they are built and tested by default.  If the build and test time
>> starts to take too long, to me that means we need to start
>> wondering if we have too many contrib modules.
>> 
>> —Bobby
>> 
>> From: Brian Enochson <brian.enoch...@gmail.com 
>> <mailto:brian.enoch...@gmail.com><mailto:brian.enoch...@gmail.com>>
>>
>> 
Reply-To: "user@storm.incubator.apache.org
>> <mailto:user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <mailto:user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org>>
>>
>> 
Date: Tuesday, February 25, 2014 at 9:50 PM
>> To: "user@storm.incubator.apache.org 
>> <mailto:user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <mailto:user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org>>
>>
>> 
Cc: "d...@storm.incubator.apache.org
>> <mailto:d...@storm.incubator.apache.org><mailto:d...@storm.incubator.apache.org>"
>>
>> 
<d...@storm.incubator.apache.org
>> <mailto:d...@storm.incubator.apache.org><mailto:d...@storm.incubator.apache.org>>
>>
>> 
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> 
>> hi, I am in agreement with Taylor and believe I understand his
>> intent. An incredible tool/framework/application like Storm is
>> only enhanced and gains value from the number of well maintained
>> and vetted modules that can be used for integration and adding
>> further functionality. I am relatively new to the Storm community
>> but have spent quite some time reviewing contributing modules out
>> there, reviewing various duplicates and running into some version
>> incompatibilities. I understand the need to keep Storm itself
>> pure, but do think there needs to be some structure and
>> governance added to the contributing modules. Look at the benefit
>> a tool like npm brings to the node community. I like the idea of
>> sponsorship, vetting and a community vote.  I, as sure many would
>> be, am willing to offer support and time to working through how
>> to set this up and helping with the implementation if it is
>> decided to pursue some solution. I hope these views are taken in
>> the sprit they are made, to make this incredible system even
>> better along with the surrounding eco-system.
>> 
>> Thanks, Brian
>> 
>> 
>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> <ptgo...@gmail.com 
>> <mailto:ptgo...@gmail.com><mailto:ptgo...@gmail.com>> wrote: Just
>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>> suggesting that whatever a “contrib” project/module/subproject
>> might become, be a clearinghouse for anything Storm-related.
>> 
>> I see it as something that is well-vetted by the Storm
>> community, subject to PPMC review, vote, etc. Entry would require
>> community review, PPMC review, and in some cases ASF IP
>> clearance/legal review. Anything added would require some level
>> of commitment from the PPMC/committers to provide some level of
>> support.
>> 
>> In other words, nothing “willy-nilly”.
>> 
>> One option could be that any module added require (X > 0)  number
>> of committers to volunteer as “sponsor”s for the module, and
>> commit to maintaining it.
>> 
>> That being said, I don’t see storm-kafka being any different
>> from anything else that provides integration points for Storm.
>> 
>> -Taylor
>> 
>> 
>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nat...@nathanmarz.com 
>> <mailto:nat...@nathanmarz.com><mailto:nat...@nathanmarz.com>>
>> wrote:
>> 
>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> projects put these contrib modules in a "contrib" folder and keep
>> them managed as completely separate codebases. As it's not
>> actually a "module" necessary for Storm, there's an argument
>> there for doing it that way rather than via the multi-module
>> route.
>> 
>> 
>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>> <mpath...@umail.iu.edu 
>> <mailto:mpath...@umail.iu.edu><mailto:mpath...@umail.iu.edu>>
>> wrote: Hi Taylor,
>> 
>> I'm +1 for pulling these external libraries into Apache codebase.
>> This will certainly benifit Strom community. I also like to
>> contribute to this process.
>> 
>> Thanks Milinda
>> 
>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> <ptgo...@gmail.com 
>> <mailto:ptgo...@gmail.com><mailto:ptgo...@gmail.com>> wrote:
>>> A while back I opened STORM-206 [1] to capture ideas for
>>> pulling in "contrib" modules to the Apache codebase.
>>> 
>>> In the past, we had the storm-contrib github project [2] which 
>>> subsequently got broken up into individual projects hosted on
>>> the stormprocessor github group [3] and elsewhere.
>>> 
>>> The problem with this approach is that in certain cases it led
>>> to code rot (modules not being updated in step with Storm's
>>> API), fragmentation (multiple similar modules with the same
>>> name), and confusion.
>>> 
>>> A good example of this is the storm-kafka module [4], since it
>>> is a widely used component. Because storm-contrib wasn't being
>>> tagged in github, a lot of users had trouble reconciling with
>>> which versions of storm it was compatible. Some users built off
>>> specific commit hashes, some forked, and a few even pushed
>>> custom builds to repositories such as clojars. With kafka 0.8
>>> now available, there are two main storm-kafka projects, the
>>> original (compatible with kafka 0.7) and an updated fork [5]
>>> (compatible with kafka 0.8).
>>> 
>>> My intention is not to find fault in any way, but rather to
>>> point out the resulting pain, and work toward a better
>>> solution.
>>> 
>>> I think it would be beneficial to the Storm user community to
>>> have certain commonly used modules like storm-kafka brought
>>> into the Apache Storm project. Another benefit worth
>>> considering is the licensing/legal oversight that the ASF
>>> provides, which is important to many users.
>>> 
>>> If this is something we want to do, then the big question
>>> becomes what sort governance process needs to be established to
>>> ensure that such things are properly maintained.
>>> 
>>> Some random thoughts, questions, etc. that jump to mind
>>> include:
>>> 
>>> What to call these things: "contib modules", "connectors",
>>> "integration modules", etc.? Build integration: I imagine they
>>> would be a multi-module submodule of the main maven build.
>>> Probably turned off by default and enabled by a maven profile. 
>>> Governance: Have one or more committer volunteers responsible
>>> for maintenance, merging patches, etc.? Proposal process for
>>> pulling new modules?
>>> 
>>> 
>>> I look forward to hearing others' opinions.
>>> 
>>> - Taylor
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>> https://github.com/nathanmarz/storm-contrib [3]
>>> https://github.com/stormprocessor [4]
>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>
>>> 
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Reply via email to