Great, thanks! That was a huge improvement.
On Thu, Jan 5, 2023 at 12:52 PM Luke Cwik <lc...@google.com> wrote: > By default Beam Java only uploads artifacts that have changed but it looks > like this is not the case for Beam Python and you need to explicitly opt in > with the --enable_artifact_caching flag[1]. > > It looks like this feature was added 1 year ago[2], should we make this on > by default? > > 1: > https://github.com/apache/beam/blob/3070160203c6734da0eb04b440e08b43f9fd33f3/sdks/python/apache_beam/options/pipeline_options.py#L794 > 2: https://github.com/apache/beam/pull/16229 > > > > On Thu, Jan 5, 2023 at 11:43 AM Lina Mårtensson <lina@camus.energy> wrote: > >> Thanks! I have now successfully written a beautiful string of protobuf >> bytes into a file via Python. 🎉 >> >> Two issues though: >> 1. Robert said the Python direct runner would just work with this - but >> it's not working. After about half an hour of these messages repeated over >> and over again I interrupted the job: >> >> E0105 07:25:48.170601677 58210 fork_posix.cc:76] Other >> threads are currently calling into gRPC, skipping fork() handlers >> >> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'2023/01/05 >> 06:57:10 Failed to obtain provisioning information: failed to dial server >> at localhost:41087\n\tcaused by:\ncontext deadline exceeded\n' >> 2. I (unsurprisingly) get back to the issue I had when I tested out the >> Spanner x-lang transform on Dataflow - the overhead for starting a job is >> unbearably slow, the time mainly spent in transferring the expansion >> service jar (115 MB) + my jar (105 MB) with my new code and its >> dependencies: >> >> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload >> to >> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar... >> >> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload >> to >> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar >> in 321 seconds. >> >> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload >> to >> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/java_bigtable_deploy-Ed1r7YOeLKLTmg2RGNktkym9sVYciCiielpk61r6CJ4.jar... >> >> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload >> to >> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/java_bigtable_deploy-Ed1r7YOeLKLTmg2RGNktkym9sVYciCiielpk61r6CJ4.jar >> in 295 seconds. >> I have a total of 13 minutes until any workers have started on Dataflow, >> then another 4.5 minutes once the job actually does anything (which >> eventually is to read a whopping 3 cells from Bigtable ;). >> >> How could this be improved? >> For one, it seems to me like the upload of >> sdks:java:io:google-cloud-platform:expansion-service:shadowJar from my >> computer shouldn't be necessary - shouldn't Dataflow have that >> already/could it be fetched by Dataflow rather than having to upload it >> over slow internet? >> And what about my own jar - it's not bound to change very often, so would >> it be possible to upload somewhere and then fetch it from there? >> >> Thanks! >> -Lina >> >> On Tue, Jan 3, 2023 at 1:23 PM Luke Cwik <lc...@google.com> wrote: >> >>> I would suggest using BigtableIO which also returns a >>> protobuf com.google.bigtable.v2.Row. This should allow you to replicate >>> what SpannerIO is doing. >>> >>> Alternatively you could provide a way to convert the HBase result into a >>> Beam row by specifying a converter and a schema for it and then you could >>> use the already well known Beam Schema type: >>> >>> https://github.com/apache/beam/blob/0b8f0b4db7a0de4977e30bcfeb50b5c14c7c1572/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1068 >>> >>> Otherwise you'll have to register the HBase result coder with a well >>> known name so that the runner API coder URN is something that you know and >>> then on the Python side you would need a coder for that URN as well allow >>> you to understand the bytes being sent across from the Java portion of the >>> pipeline. >>> >>> On Fri, Dec 30, 2022 at 12:59 AM Lina Mårtensson <lina@camus.energy> >>> wrote: >>> >>>> And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which >>>> I learned >>>> <https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips> >>>> is because the transform is trying to return something that there isn't a >>>> standard >>>> Beam coder for >>>> <https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784> >>>> . >>>> Makes sense, but... how do I fix this? The documentation talks >>>> about how to do this for the input, but not for the output. >>>> >>>> Comparing to Spanner, it looks like Spanner returns a protobuf, which >>>> I'm guessing somehow gets converted to bytes... But CloudBigtableIO >>>> <https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java> >>>> returns org.apache.hadoop.hbase.client.Result. >>>> >>>> My buildExternal method looks like follows: >>>> >>>> @Override >>>> >>>> public PTransform<PBegin, PCollection<Result>> buildExternal( >>>> >>>> BigtableReadBuilder.Configuration configuration) { >>>> >>>> >>>> return Read.from(CloudBigtableIO.read( >>>> >>>> new CloudBigtableScanConfiguration.Builder() >>>> >>>> >>>> .withProjectId(configuration.projectId) >>>> >>>> >>>> .withInstanceId(configuration.instanceId) >>>> >>>> >>>> .withTableId(configuration.tableId) >>>> >>>> .build() >>>> >>>> )); >>>> >>>> >>>> I also got a warning, which I *believe* is unrelated (but also an >>>> issue): >>>> >>>> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class >>>> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has >>>> no schema registered. Attempting to construct with setter approach." >>>> >>>> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM >>>> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader >>>> payloadToConfig' >>>> What is this schema and what should it look like? >>>> >>>> Thanks! >>>> -Lina >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson <lina@camus.energy> >>>> wrote: >>>> >>>>> Thanks! This was really helpful. It took a while to figure out the >>>>> details - a section in the docs on what's required of these jars for >>>>> non-Java users would be a great addition. >>>>> >>>>> But once I did, the Bazel config was actually quite straightforward >>>>> and makes sense. >>>>> I pasted the first section from here >>>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage> >>>>> into >>>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to >>>>> find the right ones remains confusing.) >>>>> >>>>> After that I updated my BUILD rules and Blaze had easy and >>>>> straightforward configs for it, all I needed was this: >>>>> >>>>> # From >>>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD >>>>> . >>>>> >>>>> # The auto service is what registers our Registrar class, and it needs >>>>> to be a plugin which >>>>> >>>>> # makes it run at compile-time. >>>>> >>>>> java_plugin( >>>>> >>>>> name = "auto_service_processor", >>>>> >>>>> processor_class = >>>>> "com.google.auto.service.processor.AutoServiceProcessor", >>>>> >>>>> deps = [ >>>>> >>>>> "@maven//:com_google_auto_service_auto_service", >>>>> >>>>> "@maven//:com_google_auto_service_auto_service_annotations", >>>>> >>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre", >>>>> >>>>> ], >>>>> >>>>> ) >>>>> >>>>> >>>>> java_binary( >>>>> >>>>> name = "java_hbase", >>>>> >>>>> main_class = "energy.camus.beam.BigtableRegistrar", >>>>> >>>>> plugins = [":auto_service_processor"], >>>>> >>>>> srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"], >>>>> >>>>> deps = [ >>>>> >>>>> "@maven//:com_google_auto_service_auto_service", >>>>> >>>>> "@maven//:com_google_auto_service_auto_service_annotations", >>>>> >>>>> >>>>> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam", >>>>> >>>>> >>>>> "@maven//:org_apache_beam_beam_sdks_java_core", >>>>> >>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre", >>>>> >>>>> "@maven//:org_apache_hbase_hbase_shaded_client", >>>>> >>>>> ], >>>>> >>>>> ) >>>>> >>>>> >>>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik <lc...@google.com> wrote: >>>>> >>>>>> AutoService relies on Java's compiler annotation processor. >>>>>> https://github.com/google/auto/tree/main/service#getting-started >>>>>> shows that you need to configure Java's compiler to use the annotation >>>>>> processors within AutoService. >>>>>> >>>>>> I saw this public gist that seemed to enable using the AutoService >>>>>> annotation processor with Bazel >>>>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00 >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev < >>>>>> dev@beam.apache.org> wrote: >>>>>> >>>>>>> That's good news about the direct runner, thanks! >>>>>>> >>>>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <rober...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev >>>>>>>> <dev@beam.apache.org> wrote: >>>>>>>> > >>>>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson <lina@camus.energy> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> Thanks for the detailed answers! >>>>>>>> >> >>>>>>>> >> I totally get the points about development & maintenance cost, >>>>>>>> and, >>>>>>>> >> from a user perspective, about getting the performance right. >>>>>>>> >> >>>>>>>> >> I decided to try out the Spanner connector to get a sense of how >>>>>>>> well >>>>>>>> >> the x-language approach works in our world, since that's an >>>>>>>> existing >>>>>>>> >> x-language connector. >>>>>>>> >> Overall, it works and with minimal intervention as you say - it >>>>>>>> is >>>>>>>> >> very slow, though. >>>>>>>> >> I'm a little confused about "portable runners" - if I understand >>>>>>>> this >>>>>>>> >> correctly, this means we couldn't run with the DirectRunner >>>>>>>> anymore if >>>>>>>> >> using an x-language connector? (At least it didn't work when I >>>>>>>> tried >>>>>>>> >> it.) >>>>>>>> > >>>>>>>> > >>>>>>>> > You'll have to use the portable DirectRunner - >>>>>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability >>>>>>>> > >>>>>>>> > Job service for this can be started using following command: >>>>>>>> > python apache_beam/runners/portability/local_job_service_main.py >>>>>>>> -p <port> >>>>>>>> >>>>>>>> Note that the Python direct runner is already a portable runner, so >>>>>>>> you shouldn't have to do anything special (like start up a separate >>>>>>>> job service and pass extra options) to run locally. Just use the >>>>>>>> cross-language transforms as you would any normal Python transform. >>>>>>>> >>>>>>>> The goal is to make this as smooth and transparent as possible; >>>>>>>> please >>>>>>>> keep coming back to us if you find rough edges. >>>>>>>> >>>>>>>