Re: BigTable reader for Python?

2023-01-19 Thread Lina Mårtensson via dev
rarLoader
> payloadToConfig'
> >>>>>>>>
> >>>>>>>> What is this schema and what should it look like?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -Lina
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson
>  wrote:
> >>>>>>>>>
> >>>>>>>>> Thanks! This was really helpful. It took a while to figure out
> the details - a section in the docs on what's required of these jars for
> non-Java users would be a great addition.
> >>>>>>>>>
> >>>>>>>>> But once I did, the Bazel config was actually quite
> straightforward and makes sense.
> >>>>>>>>> I pasted the first section from here into my WORKSPACE file and
> changed the artifacts to the ones I needed. (How to find the right ones
> remains confusing.)
> >>>>>>>>>
> >>>>>>>>> After that I updated my BUILD rules and Blaze had easy and
> straightforward configs for it, all I needed was this:
> >>>>>>>>>
> >>>>>>>>> # From
> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
> .
> >>>>>>>>>
> >>>>>>>>> # The auto service is what registers our Registrar class, and it
> needs to be a plugin which
> >>>>>>>>>
> >>>>>>>>> # makes it run at compile-time.
> >>>>>>>>>
> >>>>>>>>> java_plugin(
> >>>>>>>>>
> >>>>>>>>> name = "auto_service_processor",
> >>>>>>>>>
> >>>>>>>>> processor_class =
> "com.google.auto.service.processor.AutoServiceProcessor",
> >>>>>>>>>
> >>>>>>>>> deps = [
> >>>>>>>>>
> >>>>>>>>> "@maven//:com_google_auto_service_auto_service",
> >>>>>>>>>
> >>>>>>>>>
>  "@maven//:com_google_auto_service_auto_service_annotations",
> >>>>>>>>>
> >>>>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
> >>>>>>>>>
> >>>>>>>>> ],
> >>>>>>>>>
> >>>>>>>>> )
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> java_binary(
> >>>>>>>>>
> >>>>>>>>> name = "java_hbase",
> >>>>>>>>>
> >>>>>>>>> main_class = "energy.camus.beam.BigtableRegistrar",
> >>>>>>>>>
> >>>>>>>>> plugins = [":auto_service_processor"],
> >>>>>>>>>
> >>>>>>>>> srcs =
> ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],
> >>>>>>>>>
> >>>>>>>>> deps = [
> >>>>>>>>>
> >>>>>>>>> "@maven//:com_google_auto_service_auto_service",
> >>>>>>>>>
> >>>>>>>>>
>  "@maven//:com_google_auto_service_auto_service_annotations",
> >>>>>>>>>
> >>>>>>>>> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
> >>>>>>>>>
> >>>>>>>>> "@maven//:org_apache_beam_beam_sdks_java_core",
> >>>>>>>>>
> >>>>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
> >>>>>>>>>
> >>>>>>>>> "@maven//:org_apache_hbase_hbase_shaded_client",
> >>>>>>>>>
> >>>>>>>>> ],
> >>>>>>>>>
> >>>>>>>>> )
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik 
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> AutoService relies on Java's compiler annotation processor.
> https://github.com/google/auto/tree/main/service#getting-started shows
> that you need to configure Java's compiler to use the annotation processors
> within AutoService.
> >>>>>>>>>>
> >>>>>>>>>> I saw this public gist that seemed to enable using the
> AutoService annotation processor with Bazel
> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
> dev@beam.apache.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> That's good news about the direct runner, thanks!
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
> >>>>>>>>>>>>  wrote:
> >>>>>>>>>>>> >
> >>>>>>>>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson
>  wrote:
> >>>>>>>>>>>> >>
> >>>>>>>>>>>> >> Thanks for the detailed answers!
> >>>>>>>>>>>> >>
> >>>>>>>>>>>> >> I totally get the points about development & maintenance
> cost, and,
> >>>>>>>>>>>> >> from a user perspective, about getting the performance
> right.
> >>>>>>>>>>>> >>
> >>>>>>>>>>>> >> I decided to try out the Spanner connector to get a sense
> of how well
> >>>>>>>>>>>> >> the x-language approach works in our world, since that's
> an existing
> >>>>>>>>>>>> >> x-language connector.
> >>>>>>>>>>>> >> Overall, it works and with minimal intervention as you say
> - it is
> >>>>>>>>>>>> >> very slow, though.
> >>>>>>>>>>>> >> I'm a little confused about "portable runners" - if I
> understand this
> >>>>>>>>>>>> >> correctly, this means we couldn't run with the
> DirectRunner anymore if
> >>>>>>>>>>>> >> using an x-language connector? (At least it didn't work
> when I tried
> >>>>>>>>>>>> >> it.)
> >>>>>>>>>>>> >
> >>>>>>>>>>>> >
> >>>>>>>>>>>> > You'll have to use the portable DirectRunner -
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
> >>>>>>>>>>>> >
> >>>>>>>>>>>> > Job service for this can be started using following command:
> >>>>>>>>>>>> > python
> apache_beam/runners/portability/local_job_service_main.py -p 
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that the Python direct runner is already a portable
> runner, so
> >>>>>>>>>>>> you shouldn't have to do anything special (like start up a
> separate
> >>>>>>>>>>>> job service and pass extra options) to run locally. Just use
> the
> >>>>>>>>>>>> cross-language transforms as you would any normal Python
> transform.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The goal is to make this as smooth and transparent as
> possible; please
> >>>>>>>>>>>> keep coming back to us if you find rough edges.
>


Re: BigTable reader for Python?

2023-01-10 Thread Lina Mårtensson via dev
se/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java>
>>>>>>> returns org.apache.hadoop.hbase.client.Result.
>>>>>>>
>>>>>>> My buildExternal method looks like follows:
>>>>>>>
>>>>>>> @Override
>>>>>>>
>>>>>>> public PTransform>
>>>>>>> buildExternal(
>>>>>>>
>>>>>>> BigtableReadBuilder.Configuration configuration) {
>>>>>>>
>>>>>>>
>>>>>>> return Read.from(CloudBigtableIO.read(
>>>>>>>
>>>>>>> new CloudBigtableScanConfiguration.Builder()
>>>>>>>
>>>>>>>
>>>>>>> .withProjectId(configuration.projectId)
>>>>>>>
>>>>>>>
>>>>>>> .withInstanceId(configuration.instanceId)
>>>>>>>
>>>>>>>
>>>>>>> .withTableId(configuration.tableId)
>>>>>>>
>>>>>>> .build()
>>>>>>>
>>>>>>> ));
>>>>>>>
>>>>>>>
>>>>>>> I also got a warning, which I *believe* is unrelated (but also an
>>>>>>> issue):
>>>>>>>
>>>>>>> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration
>>>>>>> class
>>>>>>> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' 
>>>>>>> has
>>>>>>> no schema registered. Attempting to construct with setter approach."
>>>>>>>
>>>>>>> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
>>>>>>> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
>>>>>>> payloadToConfig'
>>>>>>> What is this schema and what should it look like?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> -Lina
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks! This was really helpful. It took a while to figure out the
>>>>>>>> details - a section in the docs on what's required of these jars for
>>>>>>>> non-Java users would be a great addition.
>>>>>>>>
>>>>>>>> But once I did, the Bazel config was actually quite straightforward
>>>>>>>> and makes sense.
>>>>>>>> I pasted the first section from here
>>>>>>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>>>>>>>  into
>>>>>>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How 
>>>>>>>> to
>>>>>>>> find the right ones remains confusing.)
>>>>>>>>
>>>>>>>> After that I updated my BUILD rules and Blaze had easy and
>>>>>>>> straightforward configs for it, all I needed was this:
>>>>>>>>
>>>>>>>> # From
>>>>>>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>>>>>>>> .
>>>>>>>>
>>>>>>>> # The auto service is what registers our Registrar class, and it
>>>>>>>> needs to be a plugin which
>>>>>>>>
>>>>>>>> # makes it run at compile-time.
>>>>>>>>
>>>>>>>> java_plugin(
>>>>>>>>
>>>>>>>> name = "auto_service_processor",
>>>>>>>>
>>>>>>>> processor_class =
>>>>>>>> "com.google.auto.service.processor.AutoServiceProcessor",
>>>>>>>>
>>>>>>>> deps = [
>>>>>>>>
>>>>>>>> "@maven//:com_google_auto_service_auto_service",
>>>>>>>>
>>>>>>>> "@maven//:com_google_auto_service_auto_service_annotations"
>>>>>>>> ,
>>>>>>>>
>>>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>>>>>
>>>>>>>> ],
>>>>>>>>
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>> java_binary(
>>>>>>>>
>>>>>>>> name = "java_hbase",
>>>>>>>>
>>>>>>>> main_class = "energy.camus.beam.BigtableRegistrar",
>>>>>>>>
>>>>>>>> plugins = [":auto_service_processor"],
>>>>>>>>
>>>>>>>> srcs = [
>>>>>>>> "src/main/java/energy/camus/beam/BigtableRegistrar.java"],
>>>>>>>>
>>>>>>>> deps = [
>>>>>>>>
>>>>>>>> "@maven//:com_google_auto_service_auto_service",
>>>>>>>>
>>>>>>>> "@maven//:com_google_auto_service_auto_service_annotations"
>>>>>>>> ,
>>>>>>>>
>>>>>>>> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>>>>>>>>
>>>>>>>>
>>>>>>>> "@maven//:org_apache_beam_beam_sdks_java_core",
>>>>>>>>
>>>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>>>>>
>>>>>>>> "@maven//:org_apache_hbase_hbase_shaded_client",
>>>>>>>>
>>>>>>>> ],
>>>>>>>>
>>>>>>>> )
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik  wrote:
>>>>>>>>
>>>>>>>>> AutoService relies on Java's compiler annotation processor.
>>>>>>>>> https://github.com/google/auto/tree/main/service#getting-started
>>>>>>>>> shows that you need to configure Java's compiler to use the annotation
>>>>>>>>> processors within AutoService.
>>>>>>>>>
>>>>>>>>> I saw this public gist that seemed to enable using the AutoService
>>>>>>>>> annotation processor with Bazel
>>>>>>>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> That's good news about the direct runner, thanks!
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>>>>>>>>>>>  wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson
>>>>>>>>>>>  wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thanks for the detailed answers!
>>>>>>>>>>> >>
>>>>>>>>>>> >> I totally get the points about development & maintenance
>>>>>>>>>>> cost, and,
>>>>>>>>>>> >> from a user perspective, about getting the performance right.
>>>>>>>>>>> >>
>>>>>>>>>>> >> I decided to try out the Spanner connector to get a sense of
>>>>>>>>>>> how well
>>>>>>>>>>> >> the x-language approach works in our world, since that's an
>>>>>>>>>>> existing
>>>>>>>>>>> >> x-language connector.
>>>>>>>>>>> >> Overall, it works and with minimal intervention as you say -
>>>>>>>>>>> it is
>>>>>>>>>>> >> very slow, though.
>>>>>>>>>>> >> I'm a little confused about "portable runners" - if I
>>>>>>>>>>> understand this
>>>>>>>>>>> >> correctly, this means we couldn't run with the DirectRunner
>>>>>>>>>>> anymore if
>>>>>>>>>>> >> using an x-language connector? (At least it didn't work when
>>>>>>>>>>> I tried
>>>>>>>>>>> >> it.)
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > You'll have to use the portable DirectRunner -
>>>>>>>>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
>>>>>>>>>>> >
>>>>>>>>>>> > Job service for this can be started using following command:
>>>>>>>>>>> > python
>>>>>>>>>>> apache_beam/runners/portability/local_job_service_main.py -p 
>>>>>>>>>>>
>>>>>>>>>>> Note that the Python direct runner is already a portable runner,
>>>>>>>>>>> so
>>>>>>>>>>> you shouldn't have to do anything special (like start up a
>>>>>>>>>>> separate
>>>>>>>>>>> job service and pass extra options) to run locally. Just use the
>>>>>>>>>>> cross-language transforms as you would any normal Python
>>>>>>>>>>> transform.
>>>>>>>>>>>
>>>>>>>>>>> The goal is to make this as smooth and transparent as possible;
>>>>>>>>>>> please
>>>>>>>>>>> keep coming back to us if you find rough edges.
>>>>>>>>>>>
>>>>>>>>>>


Re: BigTable reader for Python?

2023-01-06 Thread Lina Mårtensson via dev
ri, Dec 30, 2022 at 12:28 AM Lina Mårtensson 
>>>>> wrote:
>>>>>
>>>>>> Thanks! This was really helpful. It took a while to figure out the
>>>>>> details - a section in the docs on what's required of these jars for
>>>>>> non-Java users would be a great addition.
>>>>>>
>>>>>> But once I did, the Bazel config was actually quite straightforward
>>>>>> and makes sense.
>>>>>> I pasted the first section from here
>>>>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>>>>>  into
>>>>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
>>>>>> find the right ones remains confusing.)
>>>>>>
>>>>>> After that I updated my BUILD rules and Blaze had easy and
>>>>>> straightforward configs for it, all I needed was this:
>>>>>>
>>>>>> # From
>>>>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>>>>>> .
>>>>>>
>>>>>> # The auto service is what registers our Registrar class, and it
>>>>>> needs to be a plugin which
>>>>>>
>>>>>> # makes it run at compile-time.
>>>>>>
>>>>>> java_plugin(
>>>>>>
>>>>>> name = "auto_service_processor",
>>>>>>
>>>>>> processor_class =
>>>>>> "com.google.auto.service.processor.AutoServiceProcessor",
>>>>>>
>>>>>> deps = [
>>>>>>
>>>>>> "@maven//:com_google_auto_service_auto_service",
>>>>>>
>>>>>> "@maven//:com_google_auto_service_auto_service_annotations",
>>>>>>
>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>>>
>>>>>> ],
>>>>>>
>>>>>> )
>>>>>>
>>>>>>
>>>>>> java_binary(
>>>>>>
>>>>>> name = "java_hbase",
>>>>>>
>>>>>> main_class = "energy.camus.beam.BigtableRegistrar",
>>>>>>
>>>>>> plugins = [":auto_service_processor"],
>>>>>>
>>>>>> srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"
>>>>>> ],
>>>>>>
>>>>>> deps = [
>>>>>>
>>>>>> "@maven//:com_google_auto_service_auto_service",
>>>>>>
>>>>>> "@maven//:com_google_auto_service_auto_service_annotations",
>>>>>>
>>>>>>
>>>>>> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>>>>>>
>>>>>>
>>>>>> "@maven//:org_apache_beam_beam_sdks_java_core",
>>>>>>
>>>>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>>>
>>>>>> "@maven//:org_apache_hbase_hbase_shaded_client",
>>>>>>
>>>>>> ],
>>>>>>
>>>>>> )
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik  wrote:
>>>>>>
>>>>>>> AutoService relies on Java's compiler annotation processor.
>>>>>>> https://github.com/google/auto/tree/main/service#getting-started
>>>>>>> shows that you need to configure Java's compiler to use the annotation
>>>>>>> processors within AutoService.
>>>>>>>
>>>>>>> I saw this public gist that seemed to enable using the AutoService
>>>>>>> annotation processor with Bazel
>>>>>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> That's good news about the direct runner, thanks!
>>>>>>>>
>>>>>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>>
>>>>>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>>>>>>>>>  wrote:
>>>>>>>>> >
>>>>>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson
>>>>>>>>>  wrote:
>>>>>>>>> >>
>>>>>>>>> >> Thanks for the detailed answers!
>>>>>>>>> >>
>>>>>>>>> >> I totally get the points about development & maintenance cost,
>>>>>>>>> and,
>>>>>>>>> >> from a user perspective, about getting the performance right.
>>>>>>>>> >>
>>>>>>>>> >> I decided to try out the Spanner connector to get a sense of
>>>>>>>>> how well
>>>>>>>>> >> the x-language approach works in our world, since that's an
>>>>>>>>> existing
>>>>>>>>> >> x-language connector.
>>>>>>>>> >> Overall, it works and with minimal intervention as you say - it
>>>>>>>>> is
>>>>>>>>> >> very slow, though.
>>>>>>>>> >> I'm a little confused about "portable runners" - if I
>>>>>>>>> understand this
>>>>>>>>> >> correctly, this means we couldn't run with the DirectRunner
>>>>>>>>> anymore if
>>>>>>>>> >> using an x-language connector? (At least it didn't work when I
>>>>>>>>> tried
>>>>>>>>> >> it.)
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > You'll have to use the portable DirectRunner -
>>>>>>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
>>>>>>>>> >
>>>>>>>>> > Job service for this can be started using following command:
>>>>>>>>> > python apache_beam/runners/portability/local_job_service_main.py
>>>>>>>>> -p 
>>>>>>>>>
>>>>>>>>> Note that the Python direct runner is already a portable runner, so
>>>>>>>>> you shouldn't have to do anything special (like start up a separate
>>>>>>>>> job service and pass extra options) to run locally. Just use the
>>>>>>>>> cross-language transforms as you would any normal Python transform.
>>>>>>>>>
>>>>>>>>> The goal is to make this as smooth and transparent as possible;
>>>>>>>>> please
>>>>>>>>> keep coming back to us if you find rough edges.
>>>>>>>>>
>>>>>>>>


Re: BigTable reader for Python?

2023-01-05 Thread Lina Mårtensson via dev
sson 
>>> wrote:
>>>
>>>> And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which
>>>> I learned
>>>> <https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips>
>>>> is because the transform is trying to return something that there isn't a 
>>>> standard
>>>> Beam coder for
>>>> <https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784>
>>>> .
>>>> Makes sense, but... how do I fix this? The documentation talks
>>>> about how to do this for the input, but not for the output.
>>>>
>>>> Comparing to Spanner, it looks like Spanner returns a protobuf, which
>>>> I'm guessing somehow gets converted to bytes... But CloudBigtableIO
>>>> <https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java>
>>>> returns org.apache.hadoop.hbase.client.Result.
>>>>
>>>> My buildExternal method looks like follows:
>>>>
>>>> @Override
>>>>
>>>> public PTransform> buildExternal(
>>>>
>>>> BigtableReadBuilder.Configuration configuration) {
>>>>
>>>>
>>>> return Read.from(CloudBigtableIO.read(
>>>>
>>>> new CloudBigtableScanConfiguration.Builder()
>>>>
>>>>
>>>> .withProjectId(configuration.projectId)
>>>>
>>>>
>>>> .withInstanceId(configuration.instanceId)
>>>>
>>>>
>>>> .withTableId(configuration.tableId)
>>>>
>>>> .build()
>>>>
>>>> ));
>>>>
>>>>
>>>> I also got a warning, which I *believe* is unrelated (but also an
>>>> issue):
>>>>
>>>> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class
>>>> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has
>>>> no schema registered. Attempting to construct with setter approach."
>>>>
>>>> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
>>>> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
>>>> payloadToConfig'
>>>> What is this schema and what should it look like?
>>>>
>>>> Thanks!
>>>> -Lina
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson 
>>>> wrote:
>>>>
>>>>> Thanks! This was really helpful. It took a while to figure out the
>>>>> details - a section in the docs on what's required of these jars for
>>>>> non-Java users would be a great addition.
>>>>>
>>>>> But once I did, the Bazel config was actually quite straightforward
>>>>> and makes sense.
>>>>> I pasted the first section from here
>>>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>>>>  into
>>>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
>>>>> find the right ones remains confusing.)
>>>>>
>>>>> After that I updated my BUILD rules and Blaze had easy and
>>>>> straightforward configs for it, all I needed was this:
>>>>>
>>>>> # From
>>>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>>>>> .
>>>>>
>>>>> # The auto service is what registers our Registrar class, and it needs
>>>>> to be a plugin which
>>>>>
>>>>> # makes it run at compile-time.
>>>>>
>>>>> java_plugin(
>>>>>
>>>>> name = "auto_service_processor",
>>>>>
>>>>> processor_class =
>>>>> "com.google.auto.service.processor.AutoServiceProcessor",
>>>>>
>>>>> deps = [
>>>>>
>>>>> "@maven//:com_google_auto_service_auto_service",
>>>>>
>>>>> "@maven//:com_google_auto_service_

Re: BigTable reader for Python?

2023-01-05 Thread Lina Mårtensson via dev
tion configuration) {
>>
>>
>> return Read.from(CloudBigtableIO.read(
>>
>> new CloudBigtableScanConfiguration.Builder()
>>
>>
>> .withProjectId(configuration.projectId)
>>
>>
>> .withInstanceId(configuration.instanceId)
>>
>>
>> .withTableId(configuration.tableId)
>>
>> .build()
>>
>> ));
>>
>>
>> I also got a warning, which I *believe* is unrelated (but also an issue):
>>
>> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class
>> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has
>> no schema registered. Attempting to construct with setter approach."
>>
>> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
>> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
>> payloadToConfig'
>> What is this schema and what should it look like?
>>
>> Thanks!
>> -Lina
>>
>>
>>
>>
>>
>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson 
>> wrote:
>>
>>> Thanks! This was really helpful. It took a while to figure out the
>>> details - a section in the docs on what's required of these jars for
>>> non-Java users would be a great addition.
>>>
>>> But once I did, the Bazel config was actually quite straightforward and
>>> makes sense.
>>> I pasted the first section from here
>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>>  into
>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
>>> find the right ones remains confusing.)
>>>
>>> After that I updated my BUILD rules and Blaze had easy and
>>> straightforward configs for it, all I needed was this:
>>>
>>> # From
>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>>> .
>>>
>>> # The auto service is what registers our Registrar class, and it needs
>>> to be a plugin which
>>>
>>> # makes it run at compile-time.
>>>
>>> java_plugin(
>>>
>>> name = "auto_service_processor",
>>>
>>> processor_class =
>>> "com.google.auto.service.processor.AutoServiceProcessor",
>>>
>>> deps = [
>>>
>>> "@maven//:com_google_auto_service_auto_service",
>>>
>>> "@maven//:com_google_auto_service_auto_service_annotations",
>>>
>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>
>>> ],
>>>
>>> )
>>>
>>>
>>> java_binary(
>>>
>>> name = "java_hbase",
>>>
>>> main_class = "energy.camus.beam.BigtableRegistrar",
>>>
>>> plugins = [":auto_service_processor"],
>>>
>>> srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],
>>>
>>> deps = [
>>>
>>> "@maven//:com_google_auto_service_auto_service",
>>>
>>> "@maven//:com_google_auto_service_auto_service_annotations",
>>>
>>>
>>> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>>>
>>>
>>> "@maven//:org_apache_beam_beam_sdks_java_core",
>>>
>>> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>
>>> "@maven//:org_apache_hbase_hbase_shaded_client",
>>>
>>> ],
>>>
>>> )
>>>
>>>
>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik  wrote:
>>>
>>>> AutoService relies on Java's compiler annotation processor.
>>>> https://github.com/google/auto/tree/main/service#getting-started shows
>>>> that you need to configure Java's compiler to use the annotation processors
>>>> within AutoService.
>>>>
>>>> I saw this public gist that seemed to enable using the AutoService
>>>> annotation processor with Bazel
>>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>>>
>>>>
>>>>
>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> That'

Re: BigTable reader for Python?

2022-12-30 Thread Lina Mårtensson via dev
And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which I
learned
<https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips>
is because the transform is trying to return something that there
isn't a standard
Beam coder for
<https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784>
.
Makes sense, but... how do I fix this? The documentation talks about how to
do this for the input, but not for the output.

Comparing to Spanner, it looks like Spanner returns a protobuf, which I'm
guessing somehow gets converted to bytes... But CloudBigtableIO
<https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java>
returns org.apache.hadoop.hbase.client.Result.

My buildExternal method looks like follows:

@Override

public PTransform> buildExternal(

BigtableReadBuilder.Configuration configuration) {


return Read.from(CloudBigtableIO.read(

new CloudBigtableScanConfiguration.Builder()


.withProjectId(configuration.projectId)


.withInstanceId(configuration.instanceId)


.withTableId(configuration.tableId)

.build()

));


I also got a warning, which I *believe* is unrelated (but also an issue):

INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class
'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has
no schema registered. Attempting to construct with setter approach."

INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
payloadToConfig'
What is this schema and what should it look like?

Thanks!
-Lina





On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson  wrote:

> Thanks! This was really helpful. It took a while to figure out the details
> - a section in the docs on what's required of these jars for non-Java users
> would be a great addition.
>
> But once I did, the Bazel config was actually quite straightforward and
> makes sense.
> I pasted the first section from here
> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>  into
> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
> find the right ones remains confusing.)
>
> After that I updated my BUILD rules and Blaze had easy and straightforward
> configs for it, all I needed was this:
>
> # From
> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
> .
>
> # The auto service is what registers our Registrar class, and it needs to
> be a plugin which
>
> # makes it run at compile-time.
>
> java_plugin(
>
> name = "auto_service_processor",
>
> processor_class =
> "com.google.auto.service.processor.AutoServiceProcessor",
>
> deps = [
>
> "@maven//:com_google_auto_service_auto_service",
>
> "@maven//:com_google_auto_service_auto_service_annotations",
>
> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>
> ],
>
> )
>
>
> java_binary(
>
> name = "java_hbase",
>
> main_class = "energy.camus.beam.BigtableRegistrar",
>
> plugins = [":auto_service_processor"],
>
> srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],
>
> deps = [
>
> "@maven//:com_google_auto_service_auto_service",
>
> "@maven//:com_google_auto_service_auto_service_annotations",
>
>
> "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>
>
> "@maven//:org_apache_beam_beam_sdks_java_core",
>
> "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>
> "@maven//:org_apache_hbase_hbase_shaded_client",
>
> ],
>
> )
>
>
> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik  wrote:
>
>> AutoService relies on Java's compiler annotation processor.
>> https://github.com/google/auto/tree/main/service#getting-started shows
>> that you need to configure Java's compiler to use the annotation processors
>> within AutoService.
>>
>> I saw this public gist that seemed to enable using the AutoService
>> annotation processor with Bazel
>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>
>>
>>
>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>> dev@beam.apache.org> wrote:
>>
>>&

Re: BigTable reader for Python?

2022-12-30 Thread Lina Mårtensson via dev
Thanks! This was really helpful. It took a while to figure out the details
- a section in the docs on what's required of these jars for non-Java users
would be a great addition.

But once I did, the Bazel config was actually quite straightforward and
makes sense.
I pasted the first section from here
<https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
into
my WORKSPACE file and changed the artifacts to the ones I needed. (How to
find the right ones remains confusing.)

After that I updated my BUILD rules and Blaze had easy and straightforward
configs for it, all I needed was this:

# From
https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
.

# The auto service is what registers our Registrar class, and it needs to
be a plugin which

# makes it run at compile-time.

java_plugin(

name = "auto_service_processor",

processor_class =
"com.google.auto.service.processor.AutoServiceProcessor",

deps = [

"@maven//:com_google_auto_service_auto_service",

"@maven//:com_google_auto_service_auto_service_annotations",

"@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",

],

)


java_binary(

name = "java_hbase",

main_class = "energy.camus.beam.BigtableRegistrar",

plugins = [":auto_service_processor"],

srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],

deps = [

"@maven//:com_google_auto_service_auto_service",

"@maven//:com_google_auto_service_auto_service_annotations",


"@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",


"@maven//:org_apache_beam_beam_sdks_java_core",

"@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",

"@maven//:org_apache_hbase_hbase_shaded_client",

],

)


On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik  wrote:

> AutoService relies on Java's compiler annotation processor.
> https://github.com/google/auto/tree/main/service#getting-started shows
> that you need to configure Java's compiler to use the annotation processors
> within AutoService.
>
> I saw this public gist that seemed to enable using the AutoService
> annotation processor with Bazel
> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>
>
>
> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
> dev@beam.apache.org> wrote:
>
>> That's good news about the direct runner, thanks!
>>
>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw 
>> wrote:
>>
>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>>>  wrote:
>>> >
>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson 
>>> wrote:
>>> >>
>>> >> Thanks for the detailed answers!
>>> >>
>>> >> I totally get the points about development & maintenance cost, and,
>>> >> from a user perspective, about getting the performance right.
>>> >>
>>> >> I decided to try out the Spanner connector to get a sense of how well
>>> >> the x-language approach works in our world, since that's an existing
>>> >> x-language connector.
>>> >> Overall, it works and with minimal intervention as you say - it is
>>> >> very slow, though.
>>> >> I'm a little confused about "portable runners" - if I understand this
>>> >> correctly, this means we couldn't run with the DirectRunner anymore if
>>> >> using an x-language connector? (At least it didn't work when I tried
>>> >> it.)
>>> >
>>> >
>>> > You'll have to use the portable DirectRunner -
>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
>>> >
>>> > Job service for this can be started using following command:
>>> > python apache_beam/runners/portability/local_job_service_main.py -p
>>> 
>>>
>>> Note that the Python direct runner is already a portable runner, so
>>> you shouldn't have to do anything special (like start up a separate
>>> job service and pass extra options) to run locally. Just use the
>>> cross-language transforms as you would any normal Python transform.
>>>
>>> The goal is to make this as smooth and transparent as possible; please
>>> keep coming back to us if you find rough edges.
>>>
>>


Re: BigTable reader for Python?

2022-12-29 Thread Lina Mårtensson via dev
That's good news about the direct runner, thanks!

On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw  wrote:

> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>  wrote:
> >
> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson 
> wrote:
> >>
> >> Thanks for the detailed answers!
> >>
> >> I totally get the points about development & maintenance cost, and,
> >> from a user perspective, about getting the performance right.
> >>
> >> I decided to try out the Spanner connector to get a sense of how well
> >> the x-language approach works in our world, since that's an existing
> >> x-language connector.
> >> Overall, it works and with minimal intervention as you say - it is
> >> very slow, though.
> >> I'm a little confused about "portable runners" - if I understand this
> >> correctly, this means we couldn't run with the DirectRunner anymore if
> >> using an x-language connector? (At least it didn't work when I tried
> >> it.)
> >
> >
> > You'll have to use the portable DirectRunner -
> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
> >
> > Job service for this can be started using following command:
> > python apache_beam/runners/portability/local_job_service_main.py -p
> 
>
> Note that the Python direct runner is already a portable runner, so
> you shouldn't have to do anything special (like start up a separate
> job service and pass extra options) to run locally. Just use the
> cross-language transforms as you would any normal Python transform.
>
> The goal is to make this as smooth and transparent as possible; please
> keep coming back to us if you find rough edges.
>


Re: BigTable reader for Python?

2022-12-29 Thread Lina Mårtensson via dev
Thanks Luke!

That makes sense. Are there any instructions/do you have any further
pointers on how to build this JAR? Not working with Java normally, I don't
know anything about what to expect here. I'll have to figure out how to
build it with Bazel eventually of course, but instructions for how to build
it with Maven or whatever would help me as well.

-Lina

On Thu, Dec 29, 2022 at 1:58 PM Luke Cwik  wrote:

> I would have expected
> a META-INF/services/org.apache.beam.sdk.expansion.ExternalTransformRegistrar
> file in the jar containing the fully qualified class name
> of BigtableRegistrar in it. See
> https://repo1.maven.org/maven2/org/apache/beam/beam-sdks-java-io-kafka/2.43.0/beam-sdks-java-io-kafka-2.43.0.jar
> for an example of how Java's ServiceLoader expects the jar to be laid out.
>
> It looks like the bazel build is not generating the META-INF/ file that
> the `@AutoService` annotation is responsible for or the way that the bazel
> build is taking the output files from the build process and generating the
> jar is forgetting to take that file as well.
>
> On Wed, Dec 28, 2022 at 11:30 PM Lina Mårtensson via dev <
> dev@beam.apache.org> wrote:
>
>> I kept working with an ExternalTransformRegistrar solution (although if
>> there's an easier way, I'm all ears), and I have Java code that builds, and
>> a Python connector that tries to use it.
>>
>> My current issue is that the expansion service that's started up doesn't
>> find my transform using the URN provided:
>> RuntimeError: java.lang.UnsupportedOperationException: Unknown urn:
>> beam:external:CAMUS:bigtable_read:v1
>>
>> And I can see that my transform wasn't registered:
>>
>> INFO:apache_beam.utils.subprocess_server:b'INFO: Registering external
>> transforms: [beam:transform:org.apache.beam:pubsub_read:v1,
>> beam:transform:org.apache.beam:pubsub_write:v1,
>> beam:transform:org.apache.beam:pubsublite_write:v1,
>> beam:transform:org.apache.beam:pubsublite_read:v1,
>> beam:transform:org.apache.beam:spanner_insert:v1,
>> beam:transform:org.apache.beam:spanner_update:v1,
>> beam:transform:org.apache.beam:spanner_replace:v1,
>> beam:transform:org.apache.beam:spanner_insert_or_update:v1,
>> beam:transform:org.apache.beam:spanner_delete:v1,
>> beam:transform:org.apache.beam:spanner_read:v1,
>> beam:transform:org.apache.beam:schemaio_bigquery_read:v1,
>> beam:transform:org.apache.beam:schemaio_bigquery_write:v1,
>> beam:transform:org.apache.beam:schemaio_datastoreV1_read:v1,
>> beam:transform:org.apache.beam:schemaio_datastoreV1_write:v1,
>> beam:transform:org.apache.beam:schemaio_pubsub_read:v1,
>> beam:transform:org.apache.beam:schemaio_pubsub_write:v1,
>> beam:transform:org.apache.beam:schemaio_jdbc_read:v1,
>> beam:transform:org.apache.beam:schemaio_jdbc_write:v1,
>> beam:transform:org.apache.beam:schemaio_avro_read:v1,
>> beam:transform:org.apache.beam:schemaio_avro_write:v1,
>> beam:external:java:generate_sequence:v1]'
>>
>> I'm creating the expansion service in code like this:
>>
>> expansion_service = BeamJarExpansionService(
>>
>>
>> 'sdks:java:io:google-cloud-platform:expansion-service:shadowJar',
>>
>> extra_args=["{{PORT}}",
>> '--javaClassLookupAllowlistFile=*'],
>>
>> classpath=[
>> "/home/builder/xlang/bando/bazel-bin/bigtable/libjava_hbase.jar"])
>>
>> where libjava_hbase.jar was built by Bazel and contains my code:
>>
>> $ jar tf libjava_hbase.jar
>>
>> META-INF/
>>
>> META-INF/MANIFEST.MF
>>
>> energy/
>>
>> energy/camus/
>>
>> energy/camus/beam/
>>
>>
>> energy/camus/beam/BigtableRegistrar$BigtableReadBuilder$Configuration.class
>>
>> energy/camus/beam/BigtableRegistrar$BigtableReadBuilder.class
>>
>> energy/camus/beam/BigtableRegistrar$CrossLanguageConfiguration.class
>>
>> energy/camus/beam/BigtableRegistrar.class
>>
>> The relevant part of my code that does the registration looks like this:
>>
>> @AutoService(ExternalTransformRegistrar.class)
>>
>> public class BigtableRegistrar implements ExternalTransformRegistrar {
>>
>>
>> static final String READ_URN = "beam:external:CAMUS:bigtable_read:v1"
>> ;
>>
>>
>> @Override
>>
>> public Map>
>> knownBuilderInstances() {
>>
>> return ImmutableMap.of(READ_URN, new BigtableReadBuilder());
>>
>> }
>>
>> What am I missing that prevents my transform to be registered?
>>
&

Re: BigTable reader for Python?

2022-12-28 Thread Lina Mårtensson via dev
;> issues/feature requests are adequately handled. In the past, we have had
>>> cases where sources/sinks are available for multiple SDKs but one
>>> > is significantly better than others when it comes to the feature set
>>> (for example, BigQuery). Cross-language will make this easier and will
>>> allow us to maintain key logic in a single place.
>>> >
>>> >>
>>> >>
>>> >> If I look at the instructions for using the x-language Spanner
>>> >> connector, then using this - from the user's perspective - would
>>> >> involve installing a Java runtime.
>>> >> That's not terrible, but I fear that getting this to work with bazel
>>> >> might end up being more trouble than expected. (That has often
>>> >> happened here, and we have enough trouble with getting Python 3.9 and
>>> >> 3.10 to co-exist.)
>>> >
>>> >
>>> > From an end user perspective, all they should have to do is make sure
>>> that Java is available in the machine where the job is submitted from. Beam
>>> has features to allow starting up cross-language expansion services (that
>>> is needed during job submission) automatically so users should not have to
>>> do anything other than that.
>>> >
>>> > At job execution, Beam (portable) uses Docker-based SDK harness
>>> containers and we already release appropriate containers for each SDK. The
>>> runners should seamlessly download containers needed to execute the job.
>>> >
>>> > That said, the main downside of cross-language today is runner
>>> support. Cross-language transform support is only available for portable
>>> Beam runners (for example, Dataflow Runner v2) but this is the direction
>>> Beam runners are going anyway.
>>> >
>>> >>
>>> >>
>>> >> There are a few of us at our small start-up that have written
>>> >> MapReduces and similar in the past and are completely convinced by the
>>> >> Beam/Dataflow model. But many others have no previous experience and
>>> >> are skeptical, and see this new tool we're introducing as something
>>> >> that's more trouble than it's worth, and something they'd rather avoid
>>> >> - even when we see how lots of their use cases could be made much
>>> >> easier using Beam. I'm worried that every extra hoop to jump through
>>> >> will make it less likely to be widely used for us. Because of that, my
>>> >> bias would be towards having a Python connector rather than
>>> >> x-language, and I would find it really helpful to learn about why you
>>> >> both favor the x-language option.
>>> >
>>> >
>>> > I understand your concerns. It's certainly possible to develop the
>>> same connector in multiple SDKs (and we provide SDF source framework
>>> support in all SDK languages). But hopefully my comments above will give
>>> you an idea of the downsides of this approach :).
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
>>> > [2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java
>>> >
>>> >>
>>> >>
>>> >> Thanks!
>>> >> -Lina
>>> >>
>>> >> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
>>> dev@beam.apache.org> wrote:
>>> >> >>
>>> >> >> Hi dev,
>>> >> >>
>>> >> >> We're starting to incorporate BigTable in our stack and I've
>>> delighted
>>> >> >> my co-workers with how easy it was to create some BigTables with
>>> >> >> Beam... but there doesn't appear to be a reader for BigTable in
>>> >> >> Python.
>>> >> >>
>>> >> >> First off, is there a good reason why not/any reason why it would
>>> be difficult?
>>> >> >
>>> >> >
>>> >> > There's was a previous effort to implement a Python BT source but
>>> that was not completed:
>>> https://github.com/apache/beam/pull/11295#issuecomment-646378304
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> I could write one, but before I start, I'd love some input to make
>>> it easier.
>>> >> >>
>>> >> >> It appears that there would be two options: either write one in
>>> >> >> Python, or try to set one up with x-language from Java which I see
>>> is
>>> >> >> done e.g. with the Spanner IO Connector.
>>> >> >> Any recommendation on which one to pick or potential pitfalls in
>>> either choice?
>>> >> >>
>>> >> >> If I write one in Python, what should I think about?
>>> >> >> It is not obvious to me how to achieve parallelization, so any tips
>>> >> >> here would be welcome.
>>> >> >
>>> >> >
>>> >> > I would strongly prefer developing a  Python wrapper for the
>>> existing Java BT source using Beam's Multi-language Pipelines framework
>>> over developing a new Python source.
>>> >> >
>>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>>> >> >
>>> >> > Thanks,
>>> >> > Cham
>>> >> >
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> Thanks!
>>> >> >> -Lina
>>>
>>


Re: BigTable reader for Python?

2022-12-27 Thread Lina Mårtensson via dev
lication related issues.
>> > Additionally, based on my experience, it's very hard to get a source
>> implementation correct and performant on the first try. It could take
>> additional benchmarks/user feedback over time to get the source production
>> ready.
>> > Java BT source is already battle tested well (actually we have two Java
>> implementations [1][2] currently). So I would rather use a Java BT
>> connector as a cross-language transform than re-implementing sources for
>> other SDKs.
>> >
>> > * Minimal maintenance cost
>> >
>> > Developing a source/sink is just a part of the story. We (as a
>> community) have to maintain it over time and make sure that ongoing
>> issues/feature requests are adequately handled. In the past, we have had
>> cases where sources/sinks are available for multiple SDKs but one
>> > is significantly better than others when it comes to the feature set
>> (for example, BigQuery). Cross-language will make this easier and will
>> allow us to maintain key logic in a single place.
>> >
>> >>
>> >>
>> >> If I look at the instructions for using the x-language Spanner
>> >> connector, then using this - from the user's perspective - would
>> >> involve installing a Java runtime.
>> >> That's not terrible, but I fear that getting this to work with bazel
>> >> might end up being more trouble than expected. (That has often
>> >> happened here, and we have enough trouble with getting Python 3.9 and
>> >> 3.10 to co-exist.)
>> >
>> >
>> > From an end user perspective, all they should have to do is make sure
>> that Java is available in the machine where the job is submitted from. Beam
>> has features to allow starting up cross-language expansion services (that
>> is needed during job submission) automatically so users should not have to
>> do anything other than that.
>> >
>> > At job execution, Beam (portable) uses Docker-based SDK harness
>> containers and we already release appropriate containers for each SDK. The
>> runners should seamlessly download containers needed to execute the job.
>> >
>> > That said, the main downside of cross-language today is runner support.
>> Cross-language transform support is only available for portable Beam
>> runners (for example, Dataflow Runner v2) but this is the direction Beam
>> runners are going anyway.
>> >
>> >>
>> >>
>> >> There are a few of us at our small start-up that have written
>> >> MapReduces and similar in the past and are completely convinced by the
>> >> Beam/Dataflow model. But many others have no previous experience and
>> >> are skeptical, and see this new tool we're introducing as something
>> >> that's more trouble than it's worth, and something they'd rather avoid
>> >> - even when we see how lots of their use cases could be made much
>> >> easier using Beam. I'm worried that every extra hoop to jump through
>> >> will make it less likely to be widely used for us. Because of that, my
>> >> bias would be towards having a Python connector rather than
>> >> x-language, and I would find it really helpful to learn about why you
>> >> both favor the x-language option.
>> >
>> >
>> > I understand your concerns. It's certainly possible to develop the same
>> connector in multiple SDKs (and we provide SDF source framework support in
>> all SDK languages). But hopefully my comments above will give you an idea
>> of the downsides of this approach :).
>> >
>> > Thanks,
>> > Cham
>> >
>> > [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
>> > [2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java
>> >
>> >>
>> >>
>> >> Thanks!
>> >> -Lina
>> >>
>> >> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev <
>> dev@beam.apache.org> wrote:
>> >> >>
>> >> >> Hi dev,
>> >> >>
>> >> >> We're starting to incorporate BigTable in our stack and I've
>> delighted
>> >> >> my co-workers with how easy it was to create some BigTables with
>> >> >> Beam... but there doesn't appear to be a reader for BigTable in
>> >> >> Python.
>> >> >>
>> >> >> First off, is there a good reason why not/any reason why it would
>> be difficult?
>> >> >
>> >> >
>> >> > There's was a previous effort to implement a Python BT source but
>> that was not completed:
>> https://github.com/apache/beam/pull/11295#issuecomment-646378304
>> >> >
>> >> >>
>> >> >>
>> >> >> I could write one, but before I start, I'd love some input to make
>> it easier.
>> >> >>
>> >> >> It appears that there would be two options: either write one in
>> >> >> Python, or try to set one up with x-language from Java which I see
>> is
>> >> >> done e.g. with the Spanner IO Connector.
>> >> >> Any recommendation on which one to pick or potential pitfalls in
>> either choice?
>> >> >>
>> >> >> If I write one in Python, what should I think about?
>> >> >> It is not obvious to me how to achieve parallelization, so any tips
>> >> >> here would be welcome.
>> >> >
>> >> >
>> >> > I would strongly prefer developing a  Python wrapper for the
>> existing Java BT source using Beam's Multi-language Pipelines framework
>> over developing a new Python source.
>> >> >
>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>> >> >
>> >> > Thanks,
>> >> > Cham
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks!
>> >> >> -Lina
>>
>


Re: BigTable reader for Python?

2022-07-28 Thread Lina Mårtensson via dev
> going anyway.
>
>>
>>
>> There are a few of us at our small start-up that have written
>> MapReduces and similar in the past and are completely convinced by the
>> Beam/Dataflow model. But many others have no previous experience and
>> are skeptical, and see this new tool we're introducing as something
>> that's more trouble than it's worth, and something they'd rather avoid
>> - even when we see how lots of their use cases could be made much
>> easier using Beam. I'm worried that every extra hoop to jump through
>> will make it less likely to be widely used for us. Because of that, my
>> bias would be towards having a Python connector rather than
>> x-language, and I would find it really helpful to learn about why you
>> both favor the x-language option.
>
>
> I understand your concerns. It's certainly possible to develop the same 
> connector in multiple SDKs (and we provide SDF source framework support in 
> all SDK languages). But hopefully my comments above will give you an idea of 
> the downsides of this approach :).
>
> Thanks,
> Cham
>
> [1] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java
> [2] https://cloud.google.com/bigtable/docs/hbase-dataflow-java
>
>>
>>
>> Thanks!
>> -Lina
>>
>> On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath  
>> wrote:
>> >
>> >
>> >
>> > On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev 
>> >  wrote:
>> >>
>> >> Hi dev,
>> >>
>> >> We're starting to incorporate BigTable in our stack and I've delighted
>> >> my co-workers with how easy it was to create some BigTables with
>> >> Beam... but there doesn't appear to be a reader for BigTable in
>> >> Python.
>> >>
>> >> First off, is there a good reason why not/any reason why it would be 
>> >> difficult?
>> >
>> >
>> > There's was a previous effort to implement a Python BT source but that was 
>> > not completed: 
>> > https://github.com/apache/beam/pull/11295#issuecomment-646378304
>> >
>> >>
>> >>
>> >> I could write one, but before I start, I'd love some input to make it 
>> >> easier.
>> >>
>> >> It appears that there would be two options: either write one in
>> >> Python, or try to set one up with x-language from Java which I see is
>> >> done e.g. with the Spanner IO Connector.
>> >> Any recommendation on which one to pick or potential pitfalls in either 
>> >> choice?
>> >>
>> >> If I write one in Python, what should I think about?
>> >> It is not obvious to me how to achieve parallelization, so any tips
>> >> here would be welcome.
>> >
>> >
>> > I would strongly prefer developing a  Python wrapper for the existing Java 
>> > BT source using Beam's Multi-language Pipelines framework over developing 
>> > a new Python source.
>> > https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>> >
>> > Thanks,
>> > Cham
>> >
>> >
>> >>
>> >>
>> >> Thanks!
>> >> -Lina


Re: BigTable reader for Python?

2022-07-27 Thread Lina Mårtensson via dev
Thanks Cham!

Could you provide some more detail on your preference for developing a
Python wrapper rather than implementing a source purely in Python?

If I look at the instructions for using the x-language Spanner
connector, then using this - from the user's perspective - would
involve installing a Java runtime.
That's not terrible, but I fear that getting this to work with bazel
might end up being more trouble than expected. (That has often
happened here, and we have enough trouble with getting Python 3.9 and
3.10 to co-exist.)

There are a few of us at our small start-up that have written
MapReduces and similar in the past and are completely convinced by the
Beam/Dataflow model. But many others have no previous experience and
are skeptical, and see this new tool we're introducing as something
that's more trouble than it's worth, and something they'd rather avoid
- even when we see how lots of their use cases could be made much
easier using Beam. I'm worried that every extra hoop to jump through
will make it less likely to be widely used for us. Because of that, my
bias would be towards having a Python connector rather than
x-language, and I would find it really helpful to learn about why you
both favor the x-language option.

Thanks!
-Lina

On Tue, Jul 26, 2022 at 6:11 PM Chamikara Jayalath  wrote:
>
>
>
> On Mon, Jul 25, 2022 at 12:53 PM Lina Mårtensson via dev 
>  wrote:
>>
>> Hi dev,
>>
>> We're starting to incorporate BigTable in our stack and I've delighted
>> my co-workers with how easy it was to create some BigTables with
>> Beam... but there doesn't appear to be a reader for BigTable in
>> Python.
>>
>> First off, is there a good reason why not/any reason why it would be 
>> difficult?
>
>
> There's was a previous effort to implement a Python BT source but that was 
> not completed: 
> https://github.com/apache/beam/pull/11295#issuecomment-646378304
>
>>
>>
>> I could write one, but before I start, I'd love some input to make it easier.
>>
>> It appears that there would be two options: either write one in
>> Python, or try to set one up with x-language from Java which I see is
>> done e.g. with the Spanner IO Connector.
>> Any recommendation on which one to pick or potential pitfalls in either 
>> choice?
>>
>> If I write one in Python, what should I think about?
>> It is not obvious to me how to achieve parallelization, so any tips
>> here would be welcome.
>
>
> I would strongly prefer developing a  Python wrapper for the existing Java BT 
> source using Beam's Multi-language Pipelines framework over developing a new 
> Python source.
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
> Thanks,
> Cham
>
>
>>
>>
>> Thanks!
>> -Lina


BigTable reader for Python?

2022-07-25 Thread Lina Mårtensson via dev
Hi dev,

We're starting to incorporate BigTable in our stack and I've delighted
my co-workers with how easy it was to create some BigTables with
Beam... but there doesn't appear to be a reader for BigTable in
Python.

First off, is there a good reason why not/any reason why it would be difficult?

I could write one, but before I start, I'd love some input to make it easier.

It appears that there would be two options: either write one in
Python, or try to set one up with x-language from Java which I see is
done e.g. with the Spanner IO Connector.
Any recommendation on which one to pick or potential pitfalls in either choice?

If I write one in Python, what should I think about?
It is not obvious to me how to achieve parallelization, so any tips
here would be welcome.

Thanks!
-Lina