I would suggest using BigtableIO which also returns a protobuf com.google.bigtable.v2.Row. This should allow you to replicate what SpannerIO is doing.
Alternatively you could provide a way to convert the HBase result into a Beam row by specifying a converter and a schema for it and then you could use the already well known Beam Schema type: https://github.com/apache/beam/blob/0b8f0b4db7a0de4977e30bcfeb50b5c14c7c1572/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1068 Otherwise you'll have to register the HBase result coder with a well known name so that the runner API coder URN is something that you know and then on the Python side you would need a coder for that URN as well allow you to understand the bytes being sent across from the Java portion of the pipeline. On Fri, Dec 30, 2022 at 12:59 AM Lina Mårtensson <lina@camus.energy> wrote: > And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which I > learned > <https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips> > is because the transform is trying to return something that there isn't a > standard > Beam coder for > <https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784> > . > Makes sense, but... how do I fix this? The documentation talks about how > to do this for the input, but not for the output. > > Comparing to Spanner, it looks like Spanner returns a protobuf, which I'm > guessing somehow gets converted to bytes... But CloudBigtableIO > <https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java> > returns org.apache.hadoop.hbase.client.Result. > > My buildExternal method looks like follows: > > @Override > > public PTransform<PBegin, PCollection<Result>> buildExternal( > > BigtableReadBuilder.Configuration configuration) { > > > return Read.from(CloudBigtableIO.read( > > new CloudBigtableScanConfiguration.Builder() > > > .withProjectId(configuration.projectId) > > > .withInstanceId(configuration.instanceId) > > > .withTableId(configuration.tableId) > > .build() > > )); > > > I also got a warning, which I *believe* is unrelated (but also an issue): > > INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class > 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has > no schema registered. Attempting to construct with setter approach." > > INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM > org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader > payloadToConfig' > What is this schema and what should it look like? > > Thanks! > -Lina > > > > > > On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson <lina@camus.energy> > wrote: > >> Thanks! This was really helpful. It took a while to figure out the >> details - a section in the docs on what's required of these jars for >> non-Java users would be a great addition. >> >> But once I did, the Bazel config was actually quite straightforward and >> makes sense. >> I pasted the first section from here >> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage> >> into >> my WORKSPACE file and changed the artifacts to the ones I needed. (How to >> find the right ones remains confusing.) >> >> After that I updated my BUILD rules and Blaze had easy and >> straightforward configs for it, all I needed was this: >> >> # From >> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD >> . >> >> # The auto service is what registers our Registrar class, and it needs to >> be a plugin which >> >> # makes it run at compile-time. >> >> java_plugin( >> >> name = "auto_service_processor", >> >> processor_class = >> "com.google.auto.service.processor.AutoServiceProcessor", >> >> deps = [ >> >> "@maven//:com_google_auto_service_auto_service", >> >> "@maven//:com_google_auto_service_auto_service_annotations", >> >> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre", >> >> ], >> >> ) >> >> >> java_binary( >> >> name = "java_hbase", >> >> main_class = "energy.camus.beam.BigtableRegistrar", >> >> plugins = [":auto_service_processor"], >> >> srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"], >> >> deps = [ >> >> "@maven//:com_google_auto_service_auto_service", >> >> "@maven//:com_google_auto_service_auto_service_annotations", >> >> >> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam", >> >> >> "@maven//:org_apache_beam_beam_sdks_java_core", >> >> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre", >> >> "@maven//:org_apache_hbase_hbase_shaded_client", >> >> ], >> >> ) >> >> >> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik <lc...@google.com> wrote: >> >>> AutoService relies on Java's compiler annotation processor. >>> https://github.com/google/auto/tree/main/service#getting-started shows >>> that you need to configure Java's compiler to use the annotation processors >>> within AutoService. >>> >>> I saw this public gist that seemed to enable using the AutoService >>> annotation processor with Bazel >>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00 >>> >>> >>> >>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev < >>> dev@beam.apache.org> wrote: >>> >>>> That's good news about the direct runner, thanks! >>>> >>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev >>>>> <dev@beam.apache.org> wrote: >>>>> > >>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson <lina@camus.energy> >>>>> wrote: >>>>> >> >>>>> >> Thanks for the detailed answers! >>>>> >> >>>>> >> I totally get the points about development & maintenance cost, and, >>>>> >> from a user perspective, about getting the performance right. >>>>> >> >>>>> >> I decided to try out the Spanner connector to get a sense of how >>>>> well >>>>> >> the x-language approach works in our world, since that's an existing >>>>> >> x-language connector. >>>>> >> Overall, it works and with minimal intervention as you say - it is >>>>> >> very slow, though. >>>>> >> I'm a little confused about "portable runners" - if I understand >>>>> this >>>>> >> correctly, this means we couldn't run with the DirectRunner anymore >>>>> if >>>>> >> using an x-language connector? (At least it didn't work when I tried >>>>> >> it.) >>>>> > >>>>> > >>>>> > You'll have to use the portable DirectRunner - >>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability >>>>> > >>>>> > Job service for this can be started using following command: >>>>> > python apache_beam/runners/portability/local_job_service_main.py -p >>>>> <port> >>>>> >>>>> Note that the Python direct runner is already a portable runner, so >>>>> you shouldn't have to do anything special (like start up a separate >>>>> job service and pass extra options) to run locally. Just use the >>>>> cross-language transforms as you would any normal Python transform. >>>>> >>>>> The goal is to make this as smooth and transparent as possible; please >>>>> keep coming back to us if you find rough edges. >>>>> >>>>