Hello Hans, Just catching up on the day's mails. I'm really grateful to you for looking into this in depth and even coming up with a working setup! I've been swamped with other work but would never have gotten this far anyway. Nogmaals bedankt! ;)
I'll try to do a test run tomorrow. cheers Fabian > Am 18.08.2022 um 15:53 schrieb Hans Van Akelyen <[email protected]>: > > Hi Fabian, > > So I played around a bit more with the pipelines and I was able to launch > dataflow jobs but it's not completely working as expected. > The documentation around this is also a bit scattered everywhere so I'm not > sure I'll be able to figure out the final solution in a short period of time. > > Steps taken to get this working: > - Modified the code a bit, these changes will be merged soon [1] > - Generate a hop-fatjar.jar > - Upload a pipeline and the hop-metadata to Google Storage > - Modify the run configuration to take the fat-jar from following location > /dataflow/template/hop-fatjar.jar (location in the docker image) > - Modified the default docker to include the fat jar: > > FROM gcr.io/dataflow-templates-base/java11-template-launcher-base > <http://gcr.io/dataflow-templates-base/java11-template-launcher-base> > > ARG WORKDIR=/dataflow/template > RUN mkdir -p ${WORKDIR} > WORKDIR ${WORKDIR} > > COPY hop-fatjar.jar . > > ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam" > ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/*" > > ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"] > > - Save the image in the container registry (gcloud builds submit --tag > <image_location>:latest .) > - Create a new pipeline using following template: > > { > "defaultEnvironment": {}, > "image": "<your image location>:latest", > "metadata": { > "description": "This template allows you to start Hop pipelines on > dataflow", > "name": "Template to start a hop pipeline", > "parameters": [ > { > "helpText": "Google storage location pointing to the pipeline > you wish to start", > "label": "Google storage location pointing to the pipeline > you wish to start", > "name": "HopPipelinePath", > "regexes": [ > ".*" > ] > }, > { > "helpText": "Google storage location pointing to the Hop > Metadata you wish to use", > "label": "Google storage location pointing to the Hop > Metadata you wish to use", > "name": "HopMetadataPath", > "regexes": [ > ".*" > ] > }, > { > "helpText": "Run configuration used to launch the pipeline", > "label": "Run configuration used to launch the pipeline", > "name": "HopRunConfigurationName", > "regexes": [ > ".*" > ] > } > ] > }, > "sdkInfo": { > "language": "JAVA" > } > } > > - Fill in the parameters with the google storage location and run > configuration name > - Run the pipeline > > Now we enter the point where things get a bit strange, when you follow all > these steps you will notice a dataflow job will be started. > This Dataflow job will then spawn another Dataflow job that contains the > actual pipeline, the original job started via the pipeline will fail but your > other job will run fine. > <image.png> > The Pipeline job expects that a job file gets generated in a specific > location and it will then pick up this file to execute the actual job. > This is the part we would probably have to change our code a bit to save the > job specification to that location and not start another job via the Beam API. > > Until we get that sorted out you will have 2 jobs where one will fail on > every run, I hope this is acceptable for now. > > Cheers, > Hans > > [1] https://github.com/apache/hop/pull/1644 > <https://github.com/apache/hop/pull/1644> > > > On Thu, 18 Aug 2022 at 13:00, Hans Van Akelyen <[email protected] > <mailto:[email protected]>> wrote: > Hi Fabian, > > I've been digging into this a bit and it seems we will need some code changes > to make this work. > As far as I can tell you have to use one of the docker templates Google > provides to start a pipeline from a template. > The issue we have is that our MainBeam class requires 3 arguments to work > (filename/metadata/run configuration name). > These 3 arguments need to be the 3 first arguments passed to the class, we > have no named parameters implemented. > > When the template launches it calls java in the following way: > > Executing: java -cp /template/* org.apache.hop.beam.run.MainBeam > --pipelineLocation=test --runner=DataflowRunner --project=xxx > --templateLocation=gs://dataflow-staging-us-central1-xxxx/staging/template_launches/2022-08-18_02_34_17-10288166777030254520/job_object > --stagingLocation=gs://dataflow-staging-us-central1-xxxx/staging --labels={ > "goog-data-pipelines" : "test" } --jobName=test-mp--1660815257 > --region=us-central1 > [email protected] > <mailto:[email protected]> > --tempLocation=gs://dataflow-staging-us-central1-xxxx/tmp > > In this case it will see the first 3 arguments and select them. > <image.png> > > As I can not find a way to force those 3 arguments in there we will need to > implement named parameters in that class, I tried a bit of a hack but it did > not work, I changed the docker template to the following but the Google > script then throws an error: > > ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam > gs://xxx/0004-rest-client-get.hpl gs://xxx/hop-metadata.json Dataflow" > > As I think this will have great added value, I will work on this ASAP. When > the work has been done we can even supply the image required from our > DockerHub Account and you should be able to run Hop pipelines in dataflow by > using a simple template. > > My idea will be to add support for the following 3 named parameters: > - HopPipelinePath -> location of the pipeline (can be Google Storage) > - HopMetadataPath -> location of the metadata file (can be Google storage) > - HopRunConfigurationName > > I'll post updates here on the progress. > > Cheers, > Hans > > On Tue, 16 Aug 2022 at 11:36, Fabian Peters <[email protected] > <mailto:[email protected]>> wrote: > Hi Hans, > > No, I didn't yet have another go. The hints from Matt (didn't see that mail > on the list?) do look quite useful in the context of Datlow templates. I'll > try to see whether I can get a bit further, but if you have time to have a > look at it, I'd much appreciate! > > cheers > > Fabian > >> Am 16.08.2022 um 11:09 schrieb Hans Van Akelyen <[email protected] >> <mailto:[email protected]>>: >> >> Hi Fabian, >> >> Did you get this working and are you willing to share the final results? >> If not I will see what I can do, and we can add it to our documentation. >> >> Cheers, >> Hans >> >> On Thu, 11 Aug 2022 at 13:14, Matt Casters <[email protected] >> <mailto:[email protected]>> wrote: >> When you run class org.apache.hop.beam.run.MainBeam you need to provide 3 >> arguments to run: >> >> 1. The filename of the pipeline to run >> 2. The filename which contains Hop metadata >> 3. The name of the pipeline run configuration to use >> >> See also for example: >> https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run >> >> <https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-flink-pipeline-engine.html#_running_with_flink_run> >> >> Good luck, >> Matt >> >> >> On Thu, Aug 11, 2022 at 11:08 AM Fabian Peters <[email protected] >> <mailto:[email protected]>> wrote: >> Hello Hans, >> >> I went through the flex-template process yesterday but the generated >> template does not work. The main piece that's missing for me is how to pass >> the actual pipeline that should be run. My test boiled down to: >> >> gcloud dataflow flex-template build >> gs://foo_ag_dataflow/tmp/todays-directories.json <> \ >> --image-gcr-path >> "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest >> <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>" \ >> --sdk-language "JAVA" \ >> --flex-template-base-image JAVA11 \ >> --metadata-file >> "/Users/fabian/Documents/src/foo/fooDataEngineering/hop/dataflow/todays-directories.json" >> \ >> --jar "/Users/fabian/tmp/fat-hop.jar" \ >> --env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam" >> >> gcloud dataflow flex-template run "todays-directories-`date +%Y%m%d-%H%M%S`" >> \ >> --template-file-gcs-location >> "gs://foo_ag_dataflow/tmp/todays-directories.json <>" \ >> --region "europe-west1" >> >> With Dockerfile: >> >> FROM gcr.io/dataflow-templates-base/java11-template-launcher-base >> <http://gcr.io/dataflow-templates-base/java11-template-launcher-base> >> >> ARG WORKDIR=/dataflow/template >> RUN mkdir -p ${WORKDIR} >> WORKDIR ${WORKDIR} >> >> ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.apache.hop.beam.run.MainBeam" >> ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/dataflow/template/*" >> >> ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"] >> >> >> And "todays-directories.json": >> >> { >> "defaultEnvironment": {}, >> "image": "europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest >> <http://europe-west1-docker.pkg.dev/dashboard-foo/dataflow/hop:latest>", >> "metadata": { >> "description": "Test templates creation with Apache Hop", >> "name": "Todays directories" >> }, >> "sdkInfo": { >> "language": "JAVA" >> } >> } >> >> Thanks for having a look at it! >> >> cheers >> >> Fabian >> >>> Am 10.08.2022 um 16:03 schrieb Hans Van Akelyen <[email protected] >>> <mailto:[email protected]>>: >>> >>> Hi Fabian, >>> >>> You have indeed found something we have not yet documented, mainly because >>> we have not yet tried it out ourselves. >>> The main class that gets called when running Beam pipelines is >>> "org.apache.hop.beam.run.MainBeam". >>> >>> I was hoping the "Import as pipeline" button on a job would give you >>> everything you need to execute this but it does not. >>> I'll take a closer look the following days to see what is needed to use >>> this functionality, could be that we need to export the template based on a >>> pipeline. >>> >>> Kr, >>> Hans >>> >>> On Wed, 10 Aug 2022 at 15:46, Fabian Peters <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi all! >>> >>> Thanks to Hans' work on the REST transform, I can now deploy my jobs to >>> Dataflow. >>> >>> Next, I'd like to schedule a batch job >>> <https://cloud.google.com/community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler>, >>> but for this I need to create a >>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>template >>> <https://cloud.google.com/dataflow/docs/concepts/dataflow-templates>. I've >>> searched the Hop documentation but haven't found anything on this. I'm >>> guessing that flex-templates >>> <https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template> >>> are the way to go, due to the fat-jar, but I'm wondering what to pass as >>> the FLEX_TEMPLATE_JAVA_MAIN_CLASS. >>> >>> cheers >>> >>> Fabian >> >> >> >> -- >> Neo4j Chief Solutions Architect >> ✉ [email protected] <mailto:[email protected]> >> >> >> >
