Re: General Questions about Hop 1.0

Hans Van Akelyen Mon, 01 Nov 2021 04:50:49 -0700

Hi Peter,

Thank you for trying out Hop, I will answer as much as I can inline. We
really value the feedback.
On a general note, you are totally right our architecture, infrastructure
docs are lacking/missing at this point. We have mainly focussed on getting
all features working and should now move to providing more how-to and
blueprints, feedback on how users plan to use Hop will also help us focus
on what to document first.

On Mon, 1 Nov 2021 at 07:36, Peter Janz <[email protected]> wrote:

> Hi,
>
> first of all THANK YOU FOR APACHE HOP!
>
> For all of the following questions
> - I haven't found information in the documentation so please direct me to
> if allready available.
> - using client/UI env Windows Server 2012 R2
> - using execution env Docker container based on doc setup on CentOs 7
>
>
> *1) It would be nice to have some kind of indented
> infrastructure/environment when using Apache Hop*
> why: It is not clear to me how you indented to execute Hop Worklflows.
> Something like
> -  UI only
> Run Apache Hop GUI and execute with the default local run configuration
> - UI and Hop server
> Run Apache Hop GUI, run Hop serveradd a remote execution run configuration
> and use it as run configuartion, execute workflows via UI
>
> In my point of view missing and somehow mentioned also in the videos are
> scenarios where the dev and execution process is splitted meaning execution
> is not trigger via the UI but via some external trigger like a cron job
> - UI + Docker as execution platform
> develop in Hop GUI, copy/push project files into a file repository like GIT
> ?? CI/CD to rebuild the docker container with the current project files ??
> -> as Hop Run is changing project files in the project folder we have to
> provide a pure copy of the project files if GIT is used it won't be
> possible to pull changes based on local changes of files !!! (this seems to
> me more like a archetecture failure: why is hop run changing project files?
> may switch to use a working directory for Hop Run)
>

As you noticed, Hop does not have a way of scheduling workflows and
pipelines. We have left this out of our current scope because we believe
there are other projects that have a pure focus on this and are better
suited. We try to make Hop as flexible as possible to be integrated in
these other solutions. The most basic setup would be to have a server
containing Hop and a cron task per workflow. You could also create a
watchdog that triggers other workflows/pipelines. More advanced setups
would be integrating pipelines in Apache Airflow (Airflow operator is
something we are working on). A colleague of mine has also tried using an
AWS Lambda function to launch a pipeline when a new file is added to S3.

In the end we want to provide a large toolbox of ways you can integrate Hop
in your modern data stack and trigger it where needed in a local/hybrid or
cloud environment. Unfortunately none of this has reached the public
documentation but parts of this should arrive in the coming weeks.

Docker as an execution platform, this is something we actively use. We as
developers hate unit tests in our code because it does not test against
real life scenarios, therefore we have our own daily testing platform to
see if Hop is still working as intended. You can take a look at our daily
tests in the following location.
https://ci-builds.apache.org/job/Hop/job/Hop-integration-tests/ Those test
are based on Docker compose files that are included in our repository (
https://github.com/apache/incubator-hop/tree/master/docker/integration-tests
)

As a project and developer you would want your code to follow a CI/CD
pipeline and all the tools needed to create this are present. We are only
missing docs but it would also depend on the platform you use (Github
Actions,Jenkins,Gitlab,Jenkins X,....) so I fear we will not be able to
provide samples/code for all platforms available.

Hop Run does not and should not change project files. A project should be
self containing I will point back to our integration tests
https://github.com/apache/incubator-hop/tree/master/integration-tests each
folder here is a self contained project (the environment and hop-config
files are included too but we do not recommend doing that in a real life
scenario) but everyone can clone the repository and add these projects to
their Hop GUI without any of those files changing.

> - UI + Hop Server as execution platform
> as the server is started with a set of project files which will not be
> updated throughout its lifetime we would have to restart the server
> everytime changes occure
> -> is there a grace kill waiting for any running workflow/pipe to finish
> before the server is killed
>
No it is a simple jetty server, there is no graceful shutdown command.

-> will the server loose execution information on restarts
>
Not 100% sure but I think it does.

> -> there is currently no API documentation for Hop Server which makes the
> useage somekind of hard
>
Documentation on the available endpoints can be found here (
https://hop.apache.org/manual/latest/hop-server/rest-api.html)

We see Hop server mainly as a way to provide additional resources or to
offload resources to somewhere else, eg. you are developing on a test set
and have limited memory available but when needed you can launch your
pipeline on a beefy server to do the heavy lifting.
Or you have 1 central server that launches pipelines on several machines,
this can also be done using airflow. We do not really see a need for a Hop
server in every scenario. It can also be used as a web service if you need
a pipeline to provide some json/xml endpoint for other services.

> advanced scenarios and how would they look like
> - UI + Hop Server > Hop Server
> - UI + Hop Server & Beam
> ...
>
> *conclusion: *it would be nice to have a section of possible Hop
> infrastructure examples for development and execution as a blueprint or
> starting point maybe with some kind of picture of the infrastructure and
> which components are needed and how they are related to each other
>
>
Totally agree we will start working on this very soon, we can provide a
blueprint and a list of ways to integrate in existing stacks and
infrastructure.

> *possible bug:*
> if an environment is derived from another and both are available in the
> docker file image the base env metadatafiles are not used/copied
> automatically but must be added manually to the metadata of the configured
> execution project
> How should the project be setup if you want to use inheritance?
>
I have created a ticket to add the option to include a parent project (
https://issues.apache.org/jira/browse/HOP-3470) but we have to keep our
image as generic as possible se we won't be able to cover all scenarios
with our image, in more complex cases custom images will be needed.

>
> *2) indented usage of GIT and the hop-config.json *
> as by now hop-config.json is defining the relationship between a project
> and an environment
> "lifecycleEnvironments" : [ {
>       "name" : "prod_docker_exec",
>       "purpose" : "Production",
>       "projectName" : "its_dv_etl",
>       "configurationFiles" : [
> "${PROJECT_HOME}/environment\\prod_docker_exec-config.json" ]
>     }
>
> This would mean that hop-config.json must be part of the project files and
> thatfor part of the GIT repository. In conclusion you may only have one GIT
> repository and thatfor "project" per hop folder.
> How are multi GIT projects are intended to work? Under which folder should
> we place our GIT repository in that case?
> -> why is a project related configuration done in an ui config file? may
> move env config from hop-config.json to project-config.json
>
> *possible bug:* the UI is not filtering the environments based on the
> selected project but switching projects based on the selected environment
>

hop-config should not be a part of your git repository as it depends on the
system you are working on, this is why our load-and-execute script
registers a new project when running (
https://github.com/apache/incubator-hop/blob/master/docker/resources/load-and-execute.sh).
Your environment files should also not be part of the same git repository
as they could contain sensitive information (passwords/server names/....).

It could be that an environment is valid for multiple projects, this is why
we have split them. It makes using them a bit more complex but allows for
greater flexibility. If for example you decide to create a project per
sub-module in your warehouse/lake/data platform they could still use the
same X environments.

You are right, project should not switch when changing environment (
https://issues.apache.org/jira/browse/HOP-3471)

>
> *3) trivial tests and current node state*
> I'm sorry if these things are allready known but you may know a workaround
> and/or if unknown by now may add a bug in Jira
>
> *# the script nodes do not seem to support all internal defined types,*
> *examples are*
> - Java node returning Decimal where even a BigDecimal type is not
> registered at Rhino
> - Javascript node can't convert numbers at all (from my last test)
> -> may set up a policy that each node should have a trivial test
> verifiying that all supported types can be used as input and can be
> generated and used as output
>

Can you create tickets with samples for these? I do not think these issues
are known.

>
> *# not working in ui by now in my env*
> Execute SQL script does not allow to add Parameters, current workaround
> Java node and Execute row SQL
> -> is this intended as newer frameworks like spring would support to add
> all the fields making the use of dedicated Parameters obsolete. If so how
> can we use the paramters? Please add an example to the doc.
>

Have you selected the variable substitution checkbox? you should even be
able to mix parameters and arguments in that transform. (note: when using
string parameters you are responsible for correct quoting in the query)

>
> - it's not possile to set the database connection by a env variable, as a
> workaround we can set the database connection fields by variables but this
> seems more like a bug then indented
>

Some of the transforms allow you to use a variable in the connectionname,
there are tickets to add this to all related transforms (parent
https://issues.apache.org/jira/browse/HOP-3180) but we do recommend using
variables inside a connection as you would want to use environment files to
change database settings and not include passwords in the metadata files.
(eg. only have 1 "source system" connection and switch to the correct
servername/user/password depending on environment file)

>
> # possible improvements
> if we are working with env variables (multi env like win/linux) the
> Workflow/Pipline target path should support some kind of easy use, by now
> we have to choose a file and replace the leading part with the variable
> placeholder on all nodes referencing a file like A Pipeline or Workflow
> nodes
>

I fear I do not fully understand this question, for pipelines and workflows
you can use the ${PROJECT_HOME} variable, it even auto replaces this part
when using the file browser. for source files or file paths you would
always have to use some variable as it would depend on the context where it
is running. Could be mounted in a different location depending on the
system.

Or do you mean if you have an env variable "SOURCE_FILES" pointing to
"/tmp/source/files/" and you use the browser and select a file in this
location it would auto replace the "/tmp/source/files/" to ${SOURCE_FILES}
as it does with PROJECT_HOME? -> this would be a nice feature (created a
ticket https://issues.apache.org/jira/browse/HOP-3472)

>
> may add the possibility to have a multiline window for the Java expression
> node even if single line (short statements) are indented. By now the field
> is resizing to its content and linebreaks are not allowed making more
> complex statements some kind of hard to use.
>
>
> *4) documentation*
> some of the tables in the doc are wronly build as field names and
> descriptions are mixed and occuring on both sides of the table. see
> Pipeline > Parameter as example
>

Could not find your example, but feel free to create tickets and we will
take action to fix them.

>
> some UI links to doc are incorrect like with the Http client node
> directing to Get variables
>
>

https://issues.apache.org/jira/browse/HOP-3473 -> nice catch, solved

>
>
>
> Thank you and sorry for the long mail.
>
> Br,
> Peter
>

We really appreciate the effort you put in this mail, and I hope these
answers help you a bit. Where possible I added actionable items.
Writing code is the easy part, providing documentation that covers most
needs is the hard part.

If you have any further questions or remarks let us know.

Cheers,
Hans

Re: General Questions about Hop 1.0

Reply via email to