Re: [D] Question About Database Connections in Apache Hop Server Mode (hop)

via GitHub Sat, 30 Nov 2024 06:31:18 -0800


GitHub user hansva added a comment to the discussion: Question About Database 
Connections in Apache Hop Server Mode


I'll try to write a full and cohesive answer.
Let's start by rephrasing the original question:

> I want to run a pipeline/workflow on Hop Server using the Rest API directly, 
> without using Hop GUI or hop-run. How do I do this?

## Intro
### Hop Server
**What it is**
Hop server is a stateless server, its main purpose is to be used as an 
extension to Hop GUI to run a pipeline or workflow in a remote environment

**What it isn't**
Hop Server isn't the typical server that you would use for 
scheduling/monitoring it does not retain state or history, it does not store 
this information (except in memory for a short period). After a restart all 
previous information is lost.

By circling back to what it is, we can also discuss why it is poorly 
documented. We do **not** want it to be used in a stand-alone way; it wasn't 
made for this. We did add the endpoints to our 
[documentation](https://hop.apache.org//manual/latest/hop-server/rest-api.html) 
because people were asking for them, but honestly, it was never designed to be 
used without using the GUI or Hop Run. There are better ways, eg. using short 
lived containers which provide more flexibility and in combination with airflow 
([tutorial](https://hop.apache.org//manual/latest/how-to-guides/run-hop-in-apache-airflow.html))
 you can also use webhooks to start things.

Shameless plug: We (know.bi) are working on something better which we hope to 
showcase soon.

This covers our disclaimer, let's get back to the subject.

## Running a single pipeline

The process to start something on Hop Server is split up in 3 different 
categories:

- A single pipeline
- A single workflow
- A workflow with other pipelines/workflows

let's discuss starting a single pipeline, there are 3 steps that need to be 
taken to start a pipeline on Hop Server.

### registerPipeline
The first step is to send the pipeline and all needed environment information 
to the server. As stated before the server is stateless so it knows nothing it 
needs all information to create a successful execution.
The XML format of the request:

```
<pipeline_configuration>
  <pipeline>
  </pipeline>
  <pipeline_execution_configuration>
    <variables></variables>
    <parameters></parameters>
    <pass_export>N</pass_export>
    <log_level>Basic</log_level>
    <log_file>N</log_file>
    <log_filename/>
    <log_file_append>N</log_file_append>
    <create_parent_folder>N</create_parent_folder>
    <clear_log>Y</clear_log>
    <show_subcomponents>Y</show_subcomponents>
    <run_configuration>local</run_configuration>
  </pipeline_execution_configuration>
  <metastore_json>
  </metastore_json>
</pipeline_configuration>
```
3 blocks of information need to be included in this request:
**pipeline:**
this one is simple it's the hpl file that you wish to execute on the server.
**pipeline_execution_configuration:**
This block contains an export of all Hop variables in the <variables> section 
and parameters/variables you have defined in the Run Options dialog
![image](https://github.com/user-attachments/assets/fb9ae198-06b1-45a3-aee7-c2bf3206085b)
The variables section will also contain all variables you have defined in your 
environment, if you have defined database username/password and so on to an 
environment file they get added there.

Each variable looks like
`<variable><name>VARIABLE_NAME</name><value>VALUE</value></variable>`

**metastore_json:**
This is the part where it gets hard. The metastore_json is a Base64 encoded 
gzip stream.
To get a fast/simple preview of what's in this you could take the example from 
our docs and throw it in [this](https://www.bugdays.com/gzip-base64) website.

It boils down to a json containing all objects you have defined in the metadata 
perspective/metadata folder.

example if you only have a PostgreSQL connection, but it also needs to contain 
your run targets and all other objects that are available in your metadata 
folder. Another note: each database type can have different fields (just like 
in the UI) most of them are shared, but eg MSSQL Server has more fields.

```
{
  "rdbms": [
    {
      "rdbms": {
        "POSTGRESQL": {
          "databaseName": "postgres",
          "pluginId": "POSTGRESQL",
          "indexTablespace": null,
          "dataTablespace": null,
          "accessType": 0,
          "hostname": "localhost",
          "password": "",
          "pluginName": "PostgreSQL",
          "port": "5432",
          "servername": null,
          "attributes": {
            "SUPPORTS_TIMESTAMP_DATA_TYPE": "N",
            "QUOTE_ALL_FIELDS": "N",
            "SUPPORTS_BOOLEAN_DATA_TYPE": "Y",
            "FORCE_IDENTIFIERS_TO_LOWERCASE": "N",
            "PRESERVE_RESERVED_WORD_CASE": "Y",
            "SQL_CONNECT": "",
            "FORCE_IDENTIFIERS_TO_UPPERCASE": "N",
            "PREFERRED_SCHEMA_NAME": ""
          },
          "manualUrl": "",
          "username": "postgres"
        }
      },
      "name": "pg"
    }
  ]
}
```

After building and sending this request to the server (POST) you will get a 
response:
```
<webresult>
  <result>OK</result>
  <message>Pipeline &#39;variables&#39; was added to HopServer with id 
08bdff17-0d75-43a3-b890-05783376cbb2</message>
  <id>08bdff17-0d75-43a3-b890-05783376cbb2</id>
</webresult>
```

### prepareExec
After you get back the Id you have to hit the prepareExec with a GET request
`GET 
/hop/prepareExec/?name=variables&xml=Y&id=08bdff17-0d75-43a3-b890-05783376cbb2`

response:
```
<webresult>
  <result>OK</result>
  <message/>
  <id/>
</webresult>
```

This will prepare the pipeline for execution and it will enter a "waiting state"

### startExec
The final step is a GET to startExec to start the actual execution

`GET 
/hop/startExec/?name=variables&xml=Y&id=08bdff17-0d75-43a3-b890-05783376cbb2`

response
```
<webresult>
  <result>OK</result>
  <message/>
  <id/>
</webresult>
```

You can follow up how everything is going with the pipelineStatus endpoint.

## Closing note

These steps should help you use the REST API directly to start a pipeline, 
running a single workflow is a similar process.
Running a combination of workflows and pipelines requires more work as this is 
a specially crafted zip file that is sent to the server.

Happy coding,
Hans

GitHub link: 
https://github.com/apache/hop/discussions/4634#discussioncomment-11422350

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Question About Database Connections in Apache Hop Server Mode (hop)

Reply via email to