Hi Fabian,
I've been digging into this a bit and it seems we will need some code
changes to make this work.
As far as I can tell you have to use one of the docker templates Google
provides to start a pipeline from a template.
The issue we have is that our MainBeam class requires 3 arguments to work
(filename/metadata/run configuration name).
These 3 arguments need to be the 3 first arguments passed to the class, we
have no named parameters implemented.
When the template launches it calls java in the following way:
Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam
--pipelineLocation=test --runner=DataflowRunner --project=xxx
--templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
--stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={
"goog-data-pipelines" : "test" } --jobName=test-mp--1660815257
--region=us-central1 --serviceAccount=
[email protected]
--tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
In this case it will see the first 3 arguments and select them.
[image: image.png]
As I can not find a way to force those 3 arguments in there we will need to
implement named parameters in that class, I tried a bit of a hack but it
did not work, I changed the docker template to the following but the Google
script then throws an error:
ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam
gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
As I think this will have great added value, I will work on this ASAP. When
the work has been done we can even supply the image required from our
DockerHub Account and you should be able to run Hop pipelines in dataflow
by using a simple template.
My idea will be to add support for the following 3 named parameters:
- HopPipelinePath -> location of the pipeline (can be Google Storage)
- HopMetadataPath -> location of the metadata file (can be Google storage)
- HopRunConfigurationName
I'll post updates here on the progress.
Cheers,
Hans
On Tue, 16 Aug 2022 at 11:36, Fabian Peters <[email protected]> wrote:
> Hi Hans,
>
> No, I didn't yet have another go. The hints from Matt (didn't see that
> mail on the list?) do look quite useful in the context of Datlow templates.
> I'll try to see whether I can get a bit further, but if you have time to
> have a look at it, I'd much appreciate!
>
> cheers
>
> Fabian
>
> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <
> [email protected]>:
>
> Hi Fabian,
>
> Did you get this working and are you willing to share the final results?
> If not I will see what I can do, and we can add it to our documentation.
>
> Cheers,
> Hans
>
> On Thu, 11 Aug 2022 at 13:14, Matt Casters <[email protected]> wrote:
>
>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3
>> arguments to run:
>>
>> 1. The filename of the pipeline to run
>> 2. The filename which contains Hop metadata
>> 3. The name of the pipeline run configuration to use
>>
>> See also for example:
>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>
>> Good luck,
>> Matt
>>
>>
>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <[email protected]> wrote:
>>
>>> Hello Hans,
>>>
>>> I went through the flex-template process yesterday but the generated
>>> template does not work. The main piece that's missing for me is how to pass
>>> the actual pipeline that should be run. My test boiled down to:
>>>
>>> gcloud dataflow flex-template build
>>> gs://foo_ag_dataflow/tmp/todays-directories.json \
>>> --image-gcr-path "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest" \
>>> --sdk-language "JAVA" \
>>> --flex-template-base-image JAVA11 \
>>> --metadata-file
>>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>> \
>>> --jar "/Users/fabian/tmp/fat-hop.jar" \
>>> --env
>>> FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>>
>>> gcloud dataflow flex-template run "todays-directories-`date
>>> +%Y%m%d-%H%M%S`" \
>>> --template-file-gcs-location "
>>> gs://foo_ag_dataflow/tmp/todays-directories.json" \
>>> --region "europe-west1"
>>>
>>> With Dockerfile:
>>>
>>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
>>>
>>> ARG WORKDIR=/dataflow/template
>>> RUN mkdir -p ${WORKDIR}
>>> WORKDIR ${WORKDIR}
>>>
>>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>>>
>>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>>>
>>>
>>> And "todays-directories.json":
>>>
>>> {
>>> "defaultEnvironment": {},
>>> "image": "
>>> europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest",
>>> "metadata": {
>>> "description": "Test templates creation with Apache Hop",
>>> "name": "Todays directories"
>>> },
>>> "sdkInfo": {
>>> "language": "JAVA"
>>> }
>>> }
>>>
>>> Thanks for having a look at it!
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <
>>> [email protected]>:
>>>
>>> Hi Fabian,
>>>
>>> You have indeed found something we have not yet documented, mainly
>>> because we have not yet tried it out ourselves.
>>> The main class that gets called when running Beam pipelines is
>>> "org.apache.hop.beam.run.MainBeam".
>>>
>>> I was hoping the "Import as pipeline" button on a job would give you
>>> everything you need to execute this but it does not.
>>> I'll take a closer look the following days to see what is needed to use
>>> this functionality, could be that we need to export the template based on a
>>> pipeline.
>>>
>>> Kr,
>>> Hans
>>>
>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <[email protected]> wrote:
>>>
>>>> Hi all!
>>>>
>>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to
>>>> Dataflow.
>>>>
>>>> Next, I'd like to schedule a batch job
>>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>> but for this I need to create a
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>
>>>> template
>>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>.
>>>> I've searched the Hop documentation but haven't found anything on this. I'm
>>>> guessing that flex-templates
>>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template>
>>>> are
>>>> the way to go, due to the fat-jar, but I'm wondering what to pass as
>>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>>>
>>>> cheers
>>>>
>>>> Fabian
>>>>
>>>
>>>
>>
>> --
>> Neo4j Chief Solutions Architect
>> *✉ *[email protected]
>>
>>
>>
>>
>