This is an automated email from the ASF dual-hosted git repository.

damccorm pushed a commit to branch release-2.70
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/release-2.70 by this push:
     new 8c36858af03 Cherrypick in extras changes (#36917)
8c36858af03 is described below

commit 8c36858af039dc2c2dad1433e9c8233f3d1e100c
Author: Danny McCormick <[email protected]>
AuthorDate: Wed Nov 26 20:36:55 2025 -0500

    Cherrypick in extras changes (#36917)
    
    * split hdfs into extra (#36773)
    
    * split hdfs into extra
    
    * CHANGES
    
    * tox
    
    * try/catch
    
    * test fixes
    
    * add to coverage tasks
    
    * Update CHANGES to mention extras changes (#36875)
---
 CHANGES.md                                          |  7 +++++--
 sdks/python/apache_beam/io/hadoopfilesystem.py      | 11 +++++++++--
 sdks/python/apache_beam/io/hadoopfilesystem_test.py |  7 +++++++
 sdks/python/setup.py                                |  2 +-
 sdks/python/tox.ini                                 |  9 +++++----
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/CHANGES.md b/CHANGES.md
index 68af5a342d7..33cf7070a5f 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -81,7 +81,7 @@ Now Beam has full support for Milvus integration including 
Milvus enrichment and
 
 ## Breaking Changes
 
-* X behavior was changed ([#X](https://github.com/apache/beam/issues/X)).
+* (Python) Some Python dependencies have been split out into extras. To ensure 
all previously installed dependencies are installed, when installing Beam you 
can `pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord]`, 
though most users will not need all of these extras 
([#34554](https://github.com/apache/beam/issues/34554)).
 
 ## Deprecations
 
@@ -123,7 +123,7 @@ Now Beam has full support for Milvus integration including 
Milvus enrichment and
   - This change only affects pipelines that explicitly use the 
`pickle_library=dill` pipeline option.
   - While `dill==0.3.1.1` is still pre-installed on the official Beam SDK base 
images, it is no longer a direct dependency of the apache-beam Python package. 
This means it can be overridden by other dependencies in your environment.
   - If your pipeline uses `pickle_library=dill`, you must manually ensure 
`dill==0.3.1.1` is installed in both your submission and runtime environments.
-    - Submission environment: Install the dill extra in your local environment 
`pip install apache-beam[gcpdill]`.
+    - Submission environment: Install the dill extra in your local environment 
`pip install apache-beam[gcp,dill]`.
     - Runtime (worker) environment: Your action depends on how you manage your 
worker's environment.
       - If using default containers or custom containers with the official 
Beam base image e.g. `FROM apache/beam_python3.10_sdk:2.69.0`
         - Add `dill==0.3.1.1` to your worker's requirements file (e.g., 
requirements.txt)
@@ -137,6 +137,9 @@ Now Beam has full support for Milvus integration including 
Milvus enrichment and
 * (Python) The deterministic fallback coder for complex types like NamedTuple, 
Enum, and dataclasses now normalizes filepaths for better determinism 
guarantees. This affects streaming pipelines updating from 2.68 to 2.69 that 
utilize this fallback coder. If your pipeline is affected, you may see a 
warning like: "Using fallback deterministic coder for type X...". To update 
safely sepcify the pipeline option `--update_compatibility_version=2.68.0` 
([#36345](https://github.com/apache/beam/p [...]
 * (Python) Fixed transform naming conflict when executing DataTransform on a 
dictionary of PColls ([#30445](https://github.com/apache/beam/issues/30445)).
   This may break update compatibility if you don't provide a 
`--transform_name_mapping`.
+* (Python) Split some extras out from the core Beam package. 
([#30445](https://github.com/apache/beam/issues/30445)).
+  - If you use Enrichment with redis, Hadoop FileSystem, TFRecord, or some 
other packages, you may need to install some extras.
+  - To retain identical behavior to before, instead of `pip install 
apache-beam`, use `pip install 
apache-beam[hadoop,gcp,interactive,redis,test,tfrecord]`.
 * Removed deprecated Hadoop versions (2.10.2 and 3.2.4) that are no longer 
supported for [Iceberg](https://github.com/apache/iceberg/issues/10940) from 
IcebergIO ([#36282](https://github.com/apache/beam/issues/36282)).
 * (Go) Coder construction on SDK side is more faithful to the specs from 
runners without stripping length-prefix. This may break streaming pipeline 
update as the underlying coder could be changed 
([#36387](https://github.com/apache/beam/issues/36387)).
 * Minimum Go version for Beam Go updated to 1.25.2 
([#36461](https://github.com/apache/beam/issues/36461)).
diff --git a/sdks/python/apache_beam/io/hadoopfilesystem.py 
b/sdks/python/apache_beam/io/hadoopfilesystem.py
index cf488c228a2..3287644eed8 100644
--- a/sdks/python/apache_beam/io/hadoopfilesystem.py
+++ b/sdks/python/apache_beam/io/hadoopfilesystem.py
@@ -26,8 +26,6 @@ import posixpath
 import re
 from typing import BinaryIO  # pylint: disable=unused-import
 
-import hdfs
-
 from apache_beam.io import filesystemio
 from apache_beam.io.filesystem import BeamIOError
 from apache_beam.io.filesystem import CompressedFile
@@ -37,6 +35,11 @@ from apache_beam.io.filesystem import FileSystem
 from apache_beam.options.pipeline_options import HadoopFileSystemOptions
 from apache_beam.options.pipeline_options import PipelineOptions
 
+try:
+  import hdfs
+except ImportError:
+  hdfs = None
+
 __all__ = ['HadoopFileSystem']
 
 _HDFS_PREFIX = 'hdfs:/'
@@ -108,6 +111,10 @@ class HadoopFileSystem(FileSystem):
     See :class:`~apache_beam.options.pipeline_options.HadoopFileSystemOptions`.
     """
     super().__init__(pipeline_options)
+    if hdfs is None:
+      raise ImportError(
+          'Failed to import hdfs. You can ensure it is '
+          'installed by installing the hadoop beam extra')
     logging.getLogger('hdfs.client').setLevel(logging.WARN)
     if pipeline_options is None:
       raise ValueError('pipeline_options is not set')
diff --git a/sdks/python/apache_beam/io/hadoopfilesystem_test.py 
b/sdks/python/apache_beam/io/hadoopfilesystem_test.py
index 8c21effc882..eb0925224dd 100644
--- a/sdks/python/apache_beam/io/hadoopfilesystem_test.py
+++ b/sdks/python/apache_beam/io/hadoopfilesystem_test.py
@@ -32,6 +32,11 @@ from apache_beam.io.filesystem import BeamIOError
 from apache_beam.options.pipeline_options import HadoopFileSystemOptions
 from apache_beam.options.pipeline_options import PipelineOptions
 
+try:
+  import hdfs as actual_hdfs
+except ImportError:
+  actual_hdfs = None
+
 
 class FakeFile(io.BytesIO):
   """File object for FakeHdfs"""
@@ -201,6 +206,7 @@ class FakeHdfs(object):
 
 
 @parameterized_class(('full_urls', ), [(False, ), (True, )])
[email protected](actual_hdfs is None, "hdfs extra not installed")
 class HadoopFileSystemTest(unittest.TestCase):
   def setUp(self):
     self._fake_hdfs = FakeHdfs()
@@ -607,6 +613,7 @@ class HadoopFileSystemTest(unittest.TestCase):
     self.assertFalse(self.fs.exists(url2))
 
 
[email protected](actual_hdfs is None, "hdfs extra not installed")
 class HadoopFileSystemRuntimeValueProviderTest(unittest.TestCase):
   """Tests pipeline_options, in the form of a
   RuntimeValueProvider.runtime_options object."""
diff --git a/sdks/python/setup.py b/sdks/python/setup.py
index 289433f9ea5..b700d796983 100644
--- a/sdks/python/setup.py
+++ b/sdks/python/setup.py
@@ -379,7 +379,6 @@ if __name__ == '__main__':
           # TODO(https://github.com/grpc/grpc/issues/37710): Unpin grpc
           
'grpcio>=1.33.1,<2,!=1.48.0,!=1.59.*,!=1.60.*,!=1.61.*,!=1.62.0,!=1.62.1,<1.66.0;
 python_version <= "3.12"',  # pylint: disable=line-too-long
           'grpcio>=1.67.0; python_version >= "3.13"',
-          'hdfs>=2.1.0,<3.0.0',
           'httplib2>=0.8,<0.23.0',
           'jsonpickle>=3.0.0,<4.0.0',
           # numpy can have breaking changes in minor versions.
@@ -563,6 +562,7 @@ if __name__ == '__main__':
               # `--update` / `-U` flag to replace the dask release brought in
               # by distributed.
           ],
+          'hadoop': ['hdfs>=2.1.0,<3.0.0'],
           'yaml': [
               'docstring-parser>=0.15,<1.0',
               'jinja2>=3.0,<3.2',
diff --git a/sdks/python/tox.ini b/sdks/python/tox.ini
index da0932728b2..431cd186c1b 100644
--- a/sdks/python/tox.ini
+++ b/sdks/python/tox.ini
@@ -33,7 +33,7 @@ pip_pre = True
 # allow apps that support color to use it.
 passenv=TERM,CLOUDSDK_CONFIG,DOCKER_*,TESTCONTAINERS_*,TC_*,ALLOYDB_PASSWORD
 # Set [] options for pip installation of apache-beam tarball.
-extras = test,dataframe,redis,tfrecord,yaml
+extras = test,dataframe,hadoop,redis,tfrecord,yaml
 # Don't warn that these commands aren't installed.
 allowlist_externals =
   false
@@ -97,8 +97,8 @@ install_command = {envbindir}/python.exe {envbindir}/pip.exe 
install --retries 1
 list_dependencies_command = {envbindir}/python.exe {envbindir}/pip.exe freeze
 
 [testenv:py{310,311,312,313}-cloud]
-; extras = test,gcp,interactive,dataframe,aws,azure,redis
-extras = test,gcp,interactive,dataframe,aws,azure
+; extras = test,gcp,interactive,dataframe,aws,azure
+extras = test,hadoop,gcp,interactive,dataframe,aws,azure
 commands =
   python apache_beam/examples/complete/autocomplete_test.py
   bash {toxinidir}/scripts/run_pytest.sh {envname} "{posargs}"
@@ -173,7 +173,7 @@ setenv =
   TC_SLEEP_TIME = {env:TC_SLEEP_TIME:1}
 
 # NOTE: we could add ml_test to increase the collected code coverage metrics, 
but it would make the suite slower.
-extras = test,gcp,interactive,dataframe,aws,redis
+extras = test,hadoop,gcp,interactive,dataframe,aws,redis
 commands =
   bash {toxinidir}/scripts/run_pytest.sh {envname} "{posargs}" 
"--cov-report=xml --cov=. --cov-append"
 
@@ -228,6 +228,7 @@ deps =
   holdup==1.8.0
 extras =
   gcp
+  hdfs
 allowlist_externals =
   bash
   echo

Reply via email to