You can also manually designate a replacement jar to be used rather
than fetching the jar from maven, either as a pipeline option or (as
of the next release) as an environment variable. The format is a json
mapping from gradle targets (which is how we identify these jars) to
local files (or urls). For example, pass
--beam_services='{":sdks:java:extensions:sql:expansion-service:shadowJar":
"/path/to/your/copy.jar"}'
to use the local jar to automatically expand your SQL transforms.
See the docs at
https://github.com/apache/beam/blob/7e95776a8d08ef738be49ef47842029c306f2bf5/sdks/python/apache_beam/options/pipeline_options.py#L587
On Tue, Jan 23, 2024 at 5:59 PM Chamikara Jayalath via user
<[email protected]> wrote:
>
> The expansion service jar is needed since sql.py includes cross-language
> transforms that use the Java implementation behind the hood.
>
> Once downloaded, the jar is cached, and subsequent jobs should use the jar
> from that location.
>
> If you want to use a locally available jar, you can manually startup an
> expansion service [1] and point the Python SQL transform to that [2].
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/#choose-an-expansion-service
> [2]
> https://github.com/apache/beam/blob/7ff25d896250508570b27683bc76523ac2fe3210/sdks/python/apache_beam/transforms/sql.py#L84
>
> On Tue, Jan 23, 2024 at 3:57 PM Mark Striebeck <[email protected]>
> wrote:
>>
>> Hi,
>>
>> Sorry, this question seems so obvious that I'm sure it came up before. But I
>> couldn't find anything in the docs or the mail archives. Feel free to point
>> me in the right direction...
>>
>> We are using the Python API for Beam. Recently we started using Beam SQL -
>> which apparently needs a jar file that is not provided with the Python Pip
>> package. When I run tests,I can see that Beam downloads
>> beam-sdks-java-extensions-sql-expansion-service-2.52.0.jar and unpacks it
>> into ~/.apache_beam and uses it to start an RPC server.
>>
>> While this works for local testing, I am trying to figure out how to work
>> this into our CI and deployment process.
>>
>> Preferably would be to download a pip package that has this jar (and others)
>> in it and just uses it.
>>
>> If that doesn't exist (I couldn't find it), then we'd need to check this jar
>> file into our source tree, so that we can use it for CI but then also make
>> it part of the docker image that we use to run our Beam pipelines on GCP
>> Dataflow. How could I tell Beam to use that file instead of downloading it?
>> I tried obvious settings like CLASSPATH environment variable - but nothing
>> works. Beam always tries to fetch the file from maven.
>>
>> Again, feel free to point me to any relevant mail discussion or web page.
>>
>> Thanks
>> Mark