[DISCUSS] Hop proposal

Matt Casters Tue, 08 Sep 2020 02:57:09 -0700

Hello Apache,

Our community is eager to propose for Hop to join the Apache Incubator.
The Hop Orchestration Platform aims to help people with complex data and
metadata orchestration problems.


Below is the complete text of the proposal but you can also find it here:
https://cwiki.apache.org/confluence/display/INCUBATOR/HopProposal

Any help with respect to the incubation is appreciated including help from
a few more mentors to set us on the right track.  On behalf of my community
I'd be happy to answer any questions you might have regarding Hop.  Our
thanks go out to Max, Julian and Tom for helping us set up this proposal.

Thanks in advance for your time!

Best regards,

Matt - Hop co-founder
www.project-hop.org
---

Abstract
=========
Hop is short for the Hop Orchestration Platform. Written completely in Java
it aims to provide a wide range of data orchestration tools, including a
visual development environment, servers, metadata analysis, auditing
services and so on. As a platform Hop also wants to be a re-usable library
so that it can be easily re-used by other software.

Proposal
=========
Hop provides all the tools to build, maintain and deploy data
orchestration, ETL and data integration solutions. For example, Hop allows
you to diagram a data flow that propagates changes from a database via
Apache Kafka to a data warehouse and deploy it as an Apache Beam pipeline.
The core concepts of Hop are Pipelines and Workflows.
* Pipelines do the core data manipulation work (read, manipulate, write
data). The main items of work in pipelines are transforms. A pipeline
consists of two or more (usually many) transforms that each perform a
granular piece of work. The transforms in a pipeline run in parallel, and
together create a powerful data processing tool.
* Workflows take care of the orchestration of actions: execute pipelines,
run child workflows, environment checks, preparation, problem alerting and
so on.
If these terms sound familiar it’s because they are taken from the Apache
Beam and Apache Airflow projects.


The main components of the Hop platform are:
* hop-gui, a visual data orchestration IDE
* hop-run: a CLI tool to run workflows or pipelines
* hop-config: a CLI tool to configure Hop and its components
* hop-server: a light-weight web server to run and monitor workflows and
pipelines
* hop-translator: a tool for translating the various parts of the Hop tools
(i18n).
* hop-web: a thin client version of hop-gui for web browsers and mobile
devices


The cornerstone of the Hop platform is extensibility: all major components
of the platform are designed to be pluggable. This allows any possible
missing functionality to be created in a short amount of time.

Background
===========
The Hop Orchestration Platform has its origins in the Kettle community.
Kettle got acquired by Pentaho and after Pentaho’s acquisition by Hitachi
in 2015, the community struck out to solve problems less aligned with
Hitachi’s interests.

Rationale
==========
In the Hop community, we have always aimed to function as a meritocracy,
where contributions are accepted based on merit, and individuals gain
status in the community based on their contributions (coding and
otherwise). We’re proud to have a diverse group of people doing all the
required things in a project: development , documentation, tutorials,
architecture, testing, graphics design and much more. Bringing the project
under the Apache Software Foundation would allow us to continue and grow,
but also give our users confidence about the governance, IP status, and
future of the project.

ASF Preparation Phase
======================
The very first goal of project Hop is to find a good way to cooperate on
the development across wide geographical, economical and social spectra. To
make this possible real changes were needed to a codebase which is
essentially 20 years old. Most of these changes have been tackled by now.
We think it’s fair to say that by now, Hop is a new platform even though it
shares a common background as it partly started from the Kettle code base.
Here are a few of the key focus areas we’re trying to saveguard going
forward:
* Plugins: lightweight plugins for all major functionality. This makes it
possible to extend Hop or reduce Hop in size.  It also allows people to
implement or change functionality with minimal coding.  In other words it
makes it easier to contribute.
* Maintain an open and responsive community where every concern, feedback
and contribution is welcome.
* Maintain a clear focus on data orchestration user requirements, not on
“industry trends”
* Documentation: we set up a version controlled “adoc” system with
automated builds which is both open, controlled and reviewed.  This is
incredibly important for every Hop user and developer.
* Testing and stability: we want to massively increase stability by
implementing integration tests beyond the standard Java unit testing
because of the dynamic nature of data orchestration work.  We still have a
long way to go.  This work will never be finished.  It’s a clear and
important goal nevertheless.
* Simplicity: things are complex enough.  We follow the example of projects
like Apache Spark and Flink and so as an example “hop-run.sh” does exactly
what the name says without the need to dive into documentation.  As much as
possible we make things self-evident and will re-use existing terminology.


For a list of the changes you can look at the monthly roundup which was
compiled since February 2020.  It documents to hard work of our community
so far:


        http://www.project-hop.org/news/roundup-2020-02/
        http://www.project-hop.org/news/roundup-2020-03/
        http://www.project-hop.org/news/roundup-2020-04/
        http://www.project-hop.org/news/roundup-2020-05/
        http://www.project-hop.org/news/roundup-2020-06/
        http://www.project-hop.org/news/roundup-2020-08/


Goals
======
Here are a few more details and specifics of things we still want to take
on going forward:
* Add more plugin metadata to Transforms and Action plugins as well as
their supported engines.  This will make it easier to refine the user
interface and make the user experience better by giving to the point
feedback on what operations are supported and required.  Example metadata
to add: extra version and build information, dependencies, tags and labels
(replacing categories), documentation links, input and output capabilities,
engine capabilities and so on.
* SWT:  While the Eclipse SWT project is still supported we want to make a
list of all the commonly used API calls and stick to those with our own
API. This will help the development of hop-web and allow us to possibly
more easily migrate to different user interfaces later on.
* Integration testing: every transform and action should have an
integration test before it is released to ensure quality.  Java unit
testing has been proven to be insufficient in guarding against backward
compatibility, stability and functionality.  We need to do better.
* Apache VFS: Hop makes extensive use of this API to handle files.  As such
we want to implement the various drivers for gs://, hdfs://, s3:// through
standard Kettle plugins making it easier to choose which protocols to
support.
* Variables & Parameters:  make this experience more intuitive, clean up
the underlying API and add more options to the various user interfaces
responsible for setting and passing variables and parameters.
* Make Hop-Web an integral part of the Apache Hop project removing the code
duplication (fork) we’re dealing with now.  This includes the need to
improve various user interfaces which were designed for non-web clients.
* Make best practices and governance functionality an integral part of the
API of the project:
   * Data sets and unit testing (already done)
   * Environments and lifecycle management (partly done)
   * Git support (partly done)
   * Auditing and lineage
   * Software policies and enforcement thereof
   * Configuration management (partly done)


Current Status
===============

Meritocracy
------------
With Project Hop, we actively work to foster the existing community and
encourage community contributions. As of  September 1st 2020 we received
over 250 pull requests and have around 600 tickets in our JIRA platform (a
lot of which were created by community members) and have active discussions
in our Mattermost chat platform with over 80 members.


The last half year we started to ask users on our chat chat server for
specific feedback on terminology, features and so on.  It’s been a
wonderfully positive experience to have in-depth discussions on complex
issues with industry experts. We look forward to moving these discussions
and votes to an Apache mailing list.

Community
------------
Hop is developed, extended and maintained by a global community of users
and developers. The Hop community is what has driven its development and
growth.
The particular past history of Hop has led to a lot of interest for the
project and already led to a number of contributions, documentation and
translations.

Core Developers
----------------
We have a diverse group of core developers with people joining on a regular
basis.  Matt Casters, Rodrigo Haces and David Rosenblum are part time
developers on Hop, salaried by Neo Solutions.  Bart Maertens, Hans Van
Akelyen, Yannick Mols are part time Hop developers paid for by company
know.bi.  Doug and Gretchen Moran were Pentaho employees but along with
Rafael Valenzuela, Dan Keeley, Jason Chu, Sergio Ramazzina and many others
they can be considered to be long time consultants and community members
for over a decade that joined the Hop community in the last year or two.


Alignment
----------
We want to anchor and safeguard our development and community building
efforts for the future. We strongly believe that as an Apache project this
can be achieved in the best possible way. The Hop project also started to
align with projects like Apache Beam, Spark and Flink in it's use of
terminology, tools, manner of configuration and so on.  As mentioned
elsewhere in this document Hop is a large user of other Apache projects and
libraries and we believe that becoming an Apache project is beneficial.
Specifically for Apache Beam we believe that providing a visual pipeline
development tool can be of great value.

Known Risks
============
While the current code-base of Kettle on which we have started from is
already released under the Apache Public License 2.0 proper attribution
needs to happen to Hitachi Vantara.
We have no knowledge of existing patents on any part of the Kettle codebase.
To further reduce any risk of there even being any discussion on naming the
Hop team decided to rename the project, its tools (to be more self-evident
as well), the java API and even the main concepts (Transformations are now
called Pipelines, in line with Apache Beam naming conventions).

Orphaned products
------------------
There is little risk that the project will become orphaned. The list of
active developers is large, and consists of a mix of developers  who have
been working on the code for several years and recent arrivals in the
community.

Inexperience with Open Source
------------------------------
The project team has a long history in open source and has contributed to
Apache licensed open source projects, mostly in the Kettle ecosystem such
as Kettle itself and the many plugins and projects surrounding it. The
experience gained there has allowed us to quickly set up all required build
tools and processes.  In its fairly short history, Hop has been advocating
open source in all aspects of the project. Our submission to the Apache
Software Foundation is a logical extension of our commitment to open source
software.

Licensing
----------
The original source code we started from (see below) has been open source
since december 2005, initially under the Lesser GPL but since January 2012
all under the Apache License version 2.0. All Hop code has been scanned for
compliance with APL 2.0. We integrated Apache Rat with our build process.

Heterogeneous Developers
-------------------------
Hop is built, developed and maintained by a global community of
developers.  Input comes from a large group of developers and users from
all over the world.  At this moment over 7 companies contribute to Hop
through the developers along with a list of individuals and consultants.

Reliance on Salaried Developers
--------------------------------
Hop developers are a mix of volunteers, enthusiasts and people working for
an employer. There is also a group of consultants who want to be involved
in Hop because it allows them to do projects with it.  They are in fact our
most important users and developers since they provide valuable feedback
from the trenches.

Relationships with Other Apache Products
-----------------------------------------
Hop is a heavy user of Apache software libraries.

Apache Commons usage:
* commons-beanutils
* commons-cli
* commons-codec
* commons-collections
* commons-collections4
* commons-compiler
* commons-compress
* commons-configuration
* commons-database-model
* commons-dbcp
* commons-digester
* commons-el
* commons-httpclient
* commons-io
* commons-lang and commons-lang3
* commons-logging
* commons-math and commons-math3-3.5.jar
* commons-net
* commons-pool
* commons-validator
* commons-vfs2


Other libraries:
* Apache Batik : for the front-end SVG drawing
* Apache Xerces (XSLT, XML processing)


Other usage of Apache projects related to Hop (plugins):
* Apache Avro
* Apache Beam w/ Apache Spark, Apache Flink, …
* Apache Cassandra
* Apache CouchDB
* Apache Derby
* Apache Flume
* Apache Hadoop
* Apache Hive
* Apache Kafka
* Apache Solr
* Apache Subversion
* Apache Zookeeper


For the build process
* Apache Maven
* Apache Jenkins

An excessive Fascination with the Apache Brand
-----------------------------------------------
With this proposal we are not seeking attention or publicity. Rather, we
firmly believe in Hop, visual data pipeline development and the ability to
treat the developed data pipelines (ETL) as software code. While the
original Hop code has been open source for about 15 years, we believe
putting code on GitHub can only go so far. We see the Apache community,
processes, and mission as critical for ensuring Hop is truly
community-driven, positively impactful, and innovative open source
software. We believe Hop is a great fit for the Apache Software Foundation
due to its focus on visual data processing and its relationships to
existing ASF projects.

Documentation
==============
Over the years, the community has contributed extensive documentation to
wiki.pentaho.com. Over time, areas of the available information have become
incomplete or outdated. Most of this documentation has been reviewed,
updated and will be contributed to the Apache foundation with the Hop
source code. Documentation for the extensive new functionality that was
added to Hop in recent months is being written.
We consider documentation to be a core piece of the Hop platform and will
treat documentation as any other item of code.

Initial Source
===============
While there isn’t a Java class in Hop which is unchanged from its origins
we should mention we selected this source code to form the base of Apache
Kettle:
https://github.com/pentaho/pentaho-kettle/tree/8.2.0.7-R

We merged various changes from the WebSpoon fork found over here:
https://github.com/HiromuHota/pentaho-kettle


Various community driven Kettle plugins were written to bypass bugs, slow
down code-rot and to implement missing features.  They were were merged
into Hop from these locations:
https://github.com/mattcasters/kettle-debug-plugin (better debugging)
https://github.com/mattcasters/kettle-beam (Apache Beam support)
https://github.com/mattcasters/pentaho-pdi-dataset (Unit Testing)
https://github.com/mattcasters/kettle-needful-things (Bug fixes &
workarounds)
https://github.com/mattcasters/kettle-environment (Environment management)


The Hop repositories are currently hosted at:
https://github.com/project-hop/
* Hop: source code for the Hop project
* Hop-doc: technical documentation for the Hop project
* Hop-website: Hop website and content repository
* Hop-docker: Docker containers, Kubernetes

Source and Intellectual Property Submission Plan
=================================================
The originating source code is already licensed under an Apache 2 license:
* https://github.com/pentaho/pentaho-kettle/blob/8.2.0.7-R/LICENSE.txt
* https://github.com/HiromuHota/pentaho-kettle/blob/webspoon-8.3/LICENSE.txt
* https://github.com/mattcasters/kettle-debug-plugin/blob/master/LICENSE
* https://github.com/mattcasters/kettle-beam/blob/master/LICENSE
* https://github.com/mattcasters/pentaho-pdi-dataset/blob/master/LICENSE.txt
* https://github.com/mattcasters/kettle-needful-things/blob/master/LICENSE
* https://github.com/mattcasters/kettle-environment/blob/master/LICENSE


For all contributions we have an agreement in place:
https://cla-assistant.io/project-hop/hop

External Dependencies
======================
Over the course of the last year we removed non-essential dependencies as
much as possible and replaced them by interfaces and plugin types. We did
this to simplify the architecture.
It’s important to note all external dependencies are licensed under an
Apache 2.0 or Apache-compatible license. As we grow the Hop community we
will configure our build process to require and validate all contributions
and dependencies are licensed under the Apache 2.0 license or are under an
Apache-compatible license.

Cryptography
=============

Required Resources
===================

Mailing lists
--------------
We currently use a mix of email and Mattermost. We will migrate our
existing mailing lists to the following:

d...@hop.incubator.apache.org
u...@hop.incubator.apache.org
priv...@hop.incubator.apache.org
comm...@hop.incubator.apache.org

Git Repository
---------------
The Hop code is currently in git, we’d like to keep it that way. We request
a git repository for incubator-hop with mirroring to GitHub.

Issue Tracking
---------------
We request the creation of an Apache-hosted JIRA.

Jira ID: HOP


Other Resources
----------------
To allow other projects to use Hop as a library we would love to publish
artifacts on a Maven server like maven.apache.org.

Initial Committers
===================
* Nicholas Adment <nadm...@gmail.com>
* Hans Van Akelyen <hans.van.akel...@know.bi>
* Lokke Bruyndonckx <lokke.bruyndon...@know.bi>
* Matt Casters <matt.cast...@neo4j.com>
* Jason Chu <jianjun...@gmail.com>
* Peter Fabricius <i...@peter-fabricius.de>
* Rodrigo Haces <rodrigo.ha...@neo4j.com>
* Dave Henry <dshenr...@gmail.com>
* Hiromu Hota <hiromu.h...@gmail.com>
* Brandon Jackson <usbran...@gmail.com>
* Dan Keeley <d...@dankeeley.co.uk>
* Bart Maertens <bart.maert...@know.bi>
* Yannick Mols <yannick.m...@know.bi>
* Doug Moran <d...@dougandgretchen.com>
* Gretchen Moran <gretc...@dougandgretchen.com>
* Sergio Ramazzina <sergio.ramazz...@serasoft.it>
* Maria Carina Roldan <maria.carina.rol...@gmail.com>
* David Rosenblum <david.rosenb...@neo4j.com>
* Rafael Valenzuela <rav...@gmail.com>

Affiliations
=============
* Neo4J
   * Matt Casters
   * Rodrigo Haces
   * David Rosenblum
* Know.bi
   * Bart Maertens
   * Hans Van Akelyen
   * Lokke Bruyndonckx
   * Yannick Mols
* eHealth Africa
   * Doug & Gretchen Moran
* Schemetrica
   * Dave Henry
* Beijing Auphi Data Co
   * Jason Chu
* Serasoft Italy
   * Sergio Ramazzina
* Hitachi Research
   * Hiromu Hota


Sponsors
=========
Champion
---------
Maximilian Michels (m...@apache.org)

Nominated Mentors
------------------
Tom Barber (magicaltr...@apache.org)
Julian Hyde (jh...@apache.org)
Maximilian Michels (m...@apache.org)

Sponsoring Entity
==================
The Apache Incubator

[DISCUSS] Hop proposal

Reply via email to