Re: BigTable reader for Python?

Luke Cwik via dev Tue, 03 Jan 2023 13:23:18 -0800

I would suggest using BigtableIO which also returns a
protobuf com.google.bigtable.v2.Row. This should allow you to replicate
what SpannerIO is doing.


Alternatively you could provide a way to convert the HBase result into a
Beam row by specifying a converter and a schema for it and then you could
use the already well known Beam Schema type:
https://github.com/apache/beam/blob/0b8f0b4db7a0de4977e30bcfeb50b5c14c7c1572/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1068

Otherwise you'll have to register the HBase result coder with a well known
name so that the runner API coder URN is something that you know and then
on the Python side you would need a coder for that URN as well allow you to
understand the bytes being sent across from the Java portion of the
pipeline.

On Fri, Dec 30, 2022 at 12:59 AM Lina Mårtensson <lina@camus.energy> wrote:

> And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which I
> learned
> <https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips>
> is because the transform is trying to return something that there isn't a 
> standard
> Beam coder for
> <https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784>
> .
> Makes sense, but... how do I fix this? The documentation talks about how
> to do this for the input, but not for the output.
>
> Comparing to Spanner, it looks like Spanner returns a protobuf, which I'm
> guessing somehow gets converted to bytes... But CloudBigtableIO
> <https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java>
> returns org.apache.hadoop.hbase.client.Result.
>
> My buildExternal method looks like follows:
>
>         @Override
>
>         public PTransform<PBegin, PCollection<Result>> buildExternal(
>
>                 BigtableReadBuilder.Configuration configuration) {
>
>
>             return Read.from(CloudBigtableIO.read(
>
>                 new CloudBigtableScanConfiguration.Builder()
>
>
>                     .withProjectId(configuration.projectId)
>
>
>                     .withInstanceId(configuration.instanceId)
>
>
>                     .withTableId(configuration.tableId)
>
>                     .build()
>
>             ));
>
>
> I also got a warning, which I *believe* is unrelated (but also an issue):
>
> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class
> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has
> no schema registered. Attempting to construct with setter approach."
>
> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
> payloadToConfig'
> What is this schema and what should it look like?
>
> Thanks!
> -Lina
>
>
>
>
>
> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson <lina@camus.energy>
> wrote:
>
>> Thanks! This was really helpful. It took a while to figure out the
>> details - a section in the docs on what's required of these jars for
>> non-Java users would be a great addition.
>>
>> But once I did, the Bazel config was actually quite straightforward and
>> makes sense.
>> I pasted the first section from here
>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>  into
>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
>> find the right ones remains confusing.)
>>
>> After that I updated my BUILD rules and Blaze had easy and
>> straightforward configs for it, all I needed was this:
>>
>> # From
>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>> .
>>
>> # The auto service is what registers our Registrar class, and it needs to
>> be a plugin which
>>
>> # makes it run at compile-time.
>>
>> java_plugin(
>>
>>     name = "auto_service_processor",
>>
>>     processor_class =
>> "com.google.auto.service.processor.AutoServiceProcessor",
>>
>>     deps = [
>>
>>         "@maven//:com_google_auto_service_auto_service",
>>
>>         "@maven//:com_google_auto_service_auto_service_annotations",
>>
>>         "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>
>>     ],
>>
>> )
>>
>>
>> java_binary(
>>
>>     name = "java_hbase",
>>
>>     main_class = "energy.camus.beam.BigtableRegistrar",
>>
>>     plugins = [":auto_service_processor"],
>>
>>     srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],
>>
>>     deps = [
>>
>>         "@maven//:com_google_auto_service_auto_service",
>>
>>         "@maven//:com_google_auto_service_auto_service_annotations",
>>
>>
>>         "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>>
>>
>>         "@maven//:org_apache_beam_beam_sdks_java_core",
>>
>>         "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>
>>         "@maven//:org_apache_hbase_hbase_shaded_client",
>>
>>     ],
>>
>> )
>>
>>
>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> AutoService relies on Java's compiler annotation processor.
>>> https://github.com/google/auto/tree/main/service#getting-started shows
>>> that you need to configure Java's compiler to use the annotation processors
>>> within AutoService.
>>>
>>> I saw this public gist that seemed to enable using the AutoService
>>> annotation processor with Bazel
>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>>
>>>
>>>
>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> That's good news about the direct runner, thanks!
>>>>
>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>>>>> <dev@beam.apache.org> wrote:
>>>>> >
>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson <lina@camus.energy>
>>>>> wrote:
>>>>> >>
>>>>> >> Thanks for the detailed answers!
>>>>> >>
>>>>> >> I totally get the points about development & maintenance cost, and,
>>>>> >> from a user perspective, about getting the performance right.
>>>>> >>
>>>>> >> I decided to try out the Spanner connector to get a sense of how
>>>>> well
>>>>> >> the x-language approach works in our world, since that's an existing
>>>>> >> x-language connector.
>>>>> >> Overall, it works and with minimal intervention as you say - it is
>>>>> >> very slow, though.
>>>>> >> I'm a little confused about "portable runners" - if I understand
>>>>> this
>>>>> >> correctly, this means we couldn't run with the DirectRunner anymore
>>>>> if
>>>>> >> using an x-language connector? (At least it didn't work when I tried
>>>>> >> it.)
>>>>> >
>>>>> >
>>>>> > You'll have to use the portable DirectRunner -
>>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
>>>>> >
>>>>> > Job service for this can be started using following command:
>>>>> > python apache_beam/runners/portability/local_job_service_main.py -p
>>>>> <port>
>>>>>
>>>>> Note that the Python direct runner is already a portable runner, so
>>>>> you shouldn't have to do anything special (like start up a separate
>>>>> job service and pass extra options) to run locally. Just use the
>>>>> cross-language transforms as you would any normal Python transform.
>>>>>
>>>>> The goal is to make this as smooth and transparent as possible; please
>>>>> keep coming back to us if you find rough edges.
>>>>>
>>>>

Re: BigTable reader for Python?

Reply via email to