jonathaningram opened a new issue, #34343:
URL: https://github.com/apache/beam/issues/34343
### What happened?
Beam version: at least v2.63.0.
The `--yaml_pipeline` flag contains a string-like version of the pipeline.
The `--yaml_pipeline_file` flag contains a path to the file.
We can successfully use the `--yaml_pipeline_file` flag locally to run our
YAML pipeline. As soon as we switch to `--yaml_pipeline`, it fails with an
error. We tried both `--yaml-pipeline` and `--yaml-pipeline-file` flags from
`gcloud dataflow yaml run`, and both seem to have the same issue.
**Note: We haven't been able run any YAML pipeline with a Java provider
successfully in Dataflow, so we're interested in the possibility of a patch
being applied to Dataflow, or if there's a workaround that would be great.**
<details>
<summary>Stack trace</summary>
```
<snip>
INFO:apache_beam.yaml.yaml_transform:Expanding "Create" at line 4
INFO:apache_beam.yaml.yaml_transform:Expanding "Identity" at line 18
Traceback (most recent call last):
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 371, in create_ptransform
ptransform = provider.create_transform(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py",
line 192, in create_transform
self._service = self._service()
^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py",
line 328, in <lambda>
urns, lambda: external.JavaJarExpansionService(jar_provider()))
^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py",
line 260, in <lambda>
urns, lambda: _join_url_or_filepath(provider_base_path, jar))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py",
line 1282, in _join_url_or_filepath
path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/parse.py", line 395, in urlparse
splitresult = urlsplit(url, scheme, allow_fragments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/parse.py", line 478, in urlsplit
scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py",
line 154, in <module>
run()
File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py",
line 143, in run
yaml_transform.expand_pipeline(
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 1077, in expand_pipeline
providers or {})).expand(beam.pvalue.PBegin(pipeline))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 1042, in expand
result = expand_transform(
^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 442, in expand_transform
return expand_composite_transform(spec, scope)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 520, in expand_composite_transform
return CompositePTransform.expand(None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 508, in expand
inner_scope.compute_all()
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 196, in compute_all
self.compute_outputs(transform_id)
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 97, in wrapper
self._cache[key] = func(self, *args)
^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 232, in compute_outputs
return expand_transform(self._transforms_by_uuid[transform_id], self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 444, in expand_transform
return expand_leaf_transform(spec, scope)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 466, in expand_leaf_transform
ptransform = scope.create_ptransform(spec, inputs_dict.values())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py",
line 413, in create_ptransform
raise ValueError(
ValueError: Invalid transform specification at "Identity" at line 18: a
bytes-like object is required, not 'str'
Building pipeline...
```
</details>
I've made a repro here:
https://github.com/jonathaningram/beam-starter-java-provider-repro which
contains much of the same info as I've put in this ticket.
The issue seems to be an encoding one.
A possible patch that works locally, but I haven't verified how suitable the
fix is, so I've not proposed a PR.
Inside the `beam` repo:
```
➜ beam git:(v2.63.0) ✗ gb
* (HEAD detached at sdks/v2.63.0)
master
➜ beam git:(v2.63.0) ✗ gd
diff --git a/sdks/python/apache_beam/yaml/yaml_provider.py
b/sdks/python/apache_beam/yaml/yaml_provider.py
index aa3c5d90515..f9d1bcf914c 100755
--- a/sdks/python/apache_beam/yaml/yaml_provider.py
+++ b/sdks/python/apache_beam/yaml/yaml_provider.py
@@ -1279,7 +1279,7 @@ def _as_list(func):
def _join_url_or_filepath(base, path):
base_scheme = urllib.parse.urlparse(base, '').scheme
- path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
+ path_scheme = urllib.parse.urlparse(path.encode(), base_scheme).scheme
if path_scheme != base_scheme:
return path
elif base_scheme and base_scheme in urllib.parse.uses_relative:
```
You can mount the `beam` source code in the container in my repro and
observe that it now works:
```
docker run -v "$(pwd):/app" \
-v
"$BEAM_PYTHON_SRC:/usr/local/lib/python3.11/site-packages/apache_beam/yaml" \
-v ~/.config/gcloud:/root/.config/gcloud \
-w /app \
--entrypoint /bin/bash beam_python3.11_sdk_with_java:2.63.0 \
-c "python -m apache_beam.yaml.main --yaml_pipeline='$(yq -o=json '.'
"$PIPELINE_FILE")' --runner=DataflowRunner"
```
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [ ] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]