creste opened a new issue, #24210: URL: https://github.com/apache/beam/issues/24210
### What would you like to happen? # Problem Currently, the Azure Filesystem for the Python SDK only supports authenticating using the [`AZURE_STORAGE_CONNECTION_STRING`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L109) environment variable. That approach has several limitations: - The `AZURE_STORAGE_CONNECTION_STRING` environment variable must be defined on all systems where the pipeline executes. This is difficult to configure when using Beam worker-pool sidecar containers with the FlinkRunner because Flink may be running in session mode with different Beam pipelines needing different connection strings. - The call to [`BlobServiceClient.from_connection_string()`](https://github.com/apache/beam/blob/b952b41788acc20edbe5b75b2196f30dbf8fdeb0/sdks/python/apache_beam/io/azure/blobstorageio.py#L111) does not support all of the authentication methods supported by [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential). For my use case in particular, it does not support [Managed Identity](https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview) credentials. # Solution I plan to address the above limitations in a PR by adding new Azure-specific pipeline options described below. ## `--azure_blob_storage_connection_string` Specifies the [Azure Storage Connection String](https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string). Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment variable or the new `--azure_blob_storage_connection_string` pipeline option described below. Example: ```bash python -m apache_beam.examples.wordcount \ --input azfs://devstoreaccount1/container/* \ --output azfs://devstoreaccount1/container/py-wordcount-integration \ --azure_blob_storage_connection_string "DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://azurite:10000/devstoreaccount1;" ``` ## `--azure_blob_storage_account_url` Specifies the [Azure Blob Storage Account Endpoint URL](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview#standard-endpoints). Can be used instead of the `AZURE_STORAGE_CONNECTION_STRING` environment variable or the new `--azure_blob_storage_connection_string` pipeline option described above. This pipeline option uses [`DefaultAzureCredential()`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#authenticate-with-defaultazurecredential) to authenticate. Example: ```bash python -m apache_beam.examples.wordcount \ --input azfs://devstoreaccount1/container/* \ --output azfs://devstoreaccount1/container/py-wordcount-integration \ --azure_blob_storage_account_url https://mystorageaccount.blob.core.windows.net/ ``` ## `--azure_managed_identity_client_id` Specifies the Managed Identity Client ID. Can only be used with `--azure_blob_storage_account_url`. This pipeline option uses [`DefaultAzureCredential(managed_identity_client_id=client_id)`](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#specify-a-user-assigned-managed-identity-for-defaultazurecredential) to authenticate. Example: ```bash python -m apache_beam.examples.wordcount \ --input azfs://devstoreaccount1/container/* \ --output azfs://devstoreaccount1/container/py-wordcount-integration \ --azure_blob_storage_account_url https://devstoreaccount1.blob.core.windows.net/ \ --azure_managed_identity_client_id ca6cc1a3-4b82-48bd-97ca-8e799c0abff6 ``` # Testing Per https://github.com/apache/beam/issues/20511, the Azure Filesystem does not have integration tests against Azure or [Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio). I plan to add integration tests for the new pipeline options to run against Azurite, similar to how [HDFS does its integration tests](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/io/hdfs_integration_test). ### Issue Priority Priority: 2 ### Issue Component Component: io-py-ideas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
