Hello Hans,

Just catching up on the day's mails. I'm really grateful to you for looking 
into this in depth and even coming up with a working setup! I've been swamped 
with other work but would never have gotten this far anyway. Nogmaals bedankt! 
;)

I'll try to do a test run tomorrow.

cheers

Fabian

> Am 18.08.2022 um 15:53 schrieb Hans Van Akelyen <[email protected]>:
> 
> Hi Fabian,
> 
> So I played around a bit more with the pipelines and I was able to launch 
> dataflow jobs but it's not completely working as expected.
> The documentation around this is also a bit scattered everywhere so I'm not 
> sure I'll be able to figure out the final solution in a short period of time.
> 
> Steps taken to get this working:
> - Modified the code a bit, these changes will be merged soon [1]
> - Generate a hop-fatjar.jar
> - Upload a pipeline and the hop-metadata to Google Storage
>   - Modify the run configuration to take the fat-jar from following location 
> /dataflow/template/hop-fatjar.jar (location in the docker image)
> - Modified the default docker to include the fat jar:
>  
>  FROM gcr.io/dataflow-templates-base/java11-template-launcher-base 
> <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
> 
>   ARG WORKDIR=/dataflow/template
>   RUN mkdir -p ${WORKDIR}
>   WORKDIR ${WORKDIR}
> 
>   COPY hop-fatjar.jar .
> 
>   ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>   ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*"
> 
>   ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
> 
> - Save the image in the container registry (gcloud builds submit --tag 
> <image_location>:latest .)
> - Create a new pipeline using following template:
> 
> {
>     "defaultEnvironment": {},
>     "image": "<your image location>:latest",
>     "metadata": {
>         "description": "This template allows you to start Hop pipelines on 
> dataflow",
>         "name": "Template to start a hop pipeline",
>         "parameters": [
>             {
>                 "helpText": "Google storage location pointing to the pipeline 
> you wish to start",
>                 "label": "Google storage location pointing to the pipeline 
> you wish to start",
>                 "name": "HopPipelinePath",
>                 "regexes": [
>                     ".*"
>                 ]
>             },
>             {
>                 "helpText": "Google storage location pointing to the Hop 
> Metadata you wish to use",
>                 "label": "Google storage location pointing to the Hop 
> Metadata you wish to use",
>                 "name": "HopMetadataPath",
>                 "regexes": [
>                     ".*"
>                 ]
>             },
>             {
>                 "helpText": "Run configuration used to launch the pipeline",
>                 "label": "Run configuration used to launch the pipeline",
>                 "name": "HopRunConfigurationName",
>                 "regexes": [
>                     ".*"
>                 ]
>             }
>         ]
>     },
>     "sdkInfo": {
>         "language": "JAVA"
>     }
> }
> 
> - Fill in the parameters with the google storage location and run 
> configuration name
> - Run the pipeline
> 
> Now we enter the point where things get a bit strange, when you follow all 
> these steps you will notice a dataflow job will be started.
> This Dataflow job will then spawn another Dataflow job that contains the 
> actual pipeline, the original job started via the pipeline will fail but your 
> other job will run fine.
> <image.png>
> The Pipeline job expects that a job file gets generated in a specific 
> location and it will then pick up this file to execute the actual job.
> This is the part we would probably have to change our code a bit to save the 
> job specification to that location and not start another job via the Beam API.
> 
> Until we get that sorted out you will have 2 jobs where one will fail on 
> every run, I hope this is acceptable for now.
> 
> Cheers,
> Hans
> 
> [1] https://github.com/apache/hop/pull/1644 
> <https://github.com/apache/hop/pull/1644>
> 
> 
> On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Fabian,
> 
> I've been digging into this a bit and it seems we will need some code changes 
> to make this work.
> As far as I can tell you have to use one of the docker templates Google 
> provides to start a pipeline from a template.
> The issue we have is that our MainBeam class requires 3 arguments to work 
> (filename/metadata/run configuration name).
> These 3 arguments need to be the 3 first arguments passed to the class, we 
> have no named parameters implemented.
> 
> When the template launches it calls java in the following way:
> 
> Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam 
> --pipelineLocation=test --runner=DataflowRunner --project=xxx 
> --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object
>  --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={ 
> "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257 
> --region=us-central1 
> [email protected] 
> <mailto:[email protected]> 
> --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp
> 
> In this case it will see the first 3 arguments and select them.
> <image.png>
> 
> As I can not find a way to force those 3 arguments in there we will need to 
> implement named parameters in that class, I tried a bit of a hack but it did 
> not work, I changed the docker template to the following but the Google 
> script then throws an error:
> 
> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam 
> gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow"
> 
> As I think this will have great added value, I will work on this ASAP. When 
> the work has been done we can even supply the image required from our 
> DockerHub Account and you should be able to run Hop pipelines in dataflow by 
> using a simple template.
> 
> My idea will be to add support for the following 3 named parameters:
>  - HopPipelinePath -> location of the pipeline (can be Google Storage)
>  - HopMetadataPath -> location of the metadata file (can be Google storage)
>  - HopRunConfigurationName 
> 
> I'll post updates here on the progress.
> 
> Cheers,
> Hans
> 
> On Tue, 16 Aug 2022 at 11:36, Fabian Peters <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Hans,
> 
> No, I didn't yet have another go. The hints from Matt (didn't see that mail 
> on the list?) do look quite useful in the context of Datlow templates. I'll 
> try to see whether I can get a bit further, but if you have time to have a 
> look at it, I'd much appreciate!
> 
> cheers
> 
> Fabian
> 
>> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <[email protected] 
>> <mailto:[email protected]>>:
>> 
>> Hi Fabian,
>> 
>> Did you get this working and are you willing to share the final results?
>> If not I will see what I can do, and we can add it to our documentation.
>> 
>> Cheers,
>> Hans
>> 
>> On Thu, 11 Aug 2022 at 13:14, Matt Casters <[email protected] 
>> <mailto:[email protected]>> wrote:
>> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3 
>> arguments to run:
>> 
>> 1. The filename of the pipeline to run
>> 2. The filename which contains Hop metadata
>> 3. The name of the pipeline run configuration to use
>> 
>> See also for example: 
>> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run
>>  
>> <https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run>
>> 
>> Good luck,
>> Matt
>> 
>> 
>> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hello Hans,
>> 
>> I went through the flex-template process yesterday but the generated 
>> template does not work. The main piece that's missing for me is how to pass 
>> the actual pipeline that should be run. My test boiled down to:
>> 
>> gcloud dataflow flex-template build 
>> gs://foo_ag_dataflow/tmp/todays-directories.json <> \
>>       --image-gcr-path 
>> "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest 
>> <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>" \
>>       --sdk-language "JAVA" \
>>       --flex-template-base-image JAVA11 \
>>       --metadata-file 
>> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json"
>>  \
>>       --jar "/Users/fabian/tmp/fat-hop.jar" \
>>       --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>> 
>> gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" 
>> \
>>     --template-file-gcs-location 
>> "gs://foo_ag_dataflow/tmp/todays-directories.json <>" \
>>     --region "europe-west1"
>> 
>> With Dockerfile:
>> 
>> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base 
>> <http://gcr.io/dataflow-templates-base/java11-template-launcher-base>
>> 
>> ARG WORKDIR=/dataflow/template
>> RUN mkdir -p ${WORKDIR}
>> WORKDIR ${WORKDIR}
>> 
>> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam"
>> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*"
>> 
>> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
>> 
>> 
>> And "todays-directories.json":
>> 
>> {
>>     "defaultEnvironment": {},
>>     "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest 
>> <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>",
>>     "metadata": {
>>         "description": "Test templates creation with Apache Hop",
>>         "name": "Todays directories"
>>     },
>>     "sdkInfo": {
>>         "language": "JAVA"
>>     }
>> }
>> 
>> Thanks for having a look at it!
>> 
>> cheers
>> 
>> Fabian
>> 
>>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <[email protected] 
>>> <mailto:[email protected]>>:
>>> 
>>> Hi Fabian,
>>> 
>>> You have indeed found something we have not yet documented, mainly because 
>>> we have not yet tried it out ourselves.
>>> The main class that gets called when running Beam pipelines is 
>>> "org.apache.hop.beam.run.MainBeam".
>>> 
>>> I was hoping the "Import as pipeline" button on a job would give you 
>>> everything you need to execute this but it does not.
>>> I'll take a closer look the following days to see what is needed to use 
>>> this functionality, could be that we need to export the template based on a 
>>> pipeline.
>>> 
>>> Kr,
>>> Hans
>>> 
>>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi all!
>>> 
>>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to 
>>> Dataflow.
>>> 
>>> Next, I'd like to schedule a batch job 
>>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>,
>>>  but for this I need to create a  
>>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template
>>>  <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've 
>>> searched the Hop documentation but haven't found anything on this. I'm 
>>> guessing that flex-templates 
>>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template>
>>>  are the way to go, due to the fat-jar, but I'm wondering what to pass as 
>>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS.
>>> 
>>> cheers
>>> 
>>> Fabian
>> 
>> 
>> 
>> -- 
>> Neo4j Chief Solutions Architect
>> ✉   [email protected] <mailto:[email protected]>
>> 
>> 
>> 
> 

Reply via email to