Hi-
I am attempting to use the new Azure filesystem object in C++. Arrow/Parquet
version 16.0.0. I already have code that works for GCS and AWS/S3. I have
been waiting for quite a while to see the new Azure filesystem object released.
Now that it has in this version (16.0.0) I have been trying to use it.
Without success. I presumed that it would work in the same manner in which the
GCS and S3/AWS filesystem objects work. You create the object, then you can
use it in the same manner that you used the other filesystem objects. Note
that I am not using Arrow methods to read/write the data but rather the Parquet
methods. This works for local, GCS and S3/AWS. However I cannot open a file
on Azure. It seems like no matter which authentication method I try to use, it
doesn't work. And I get different results depending on which auth approach I
take (client secret versus account key, etc.). Here is a code summary of what
I am trying to do:
arrow::fs::AzureOptions azureOptions;
arrow::Status configureStatus = arrow::Status::OK();
// exact values obfuscated
azureOptions.account_name = "mytest";
azureOptions.dfs_storage_authority = ".dfs.core.windows.net";
azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If I
don't do this, then the
//
blob.core.windows.net is used;
// I want
dfs not blob, so... not certain
// why
that happens either
std::string client_id = "3f061894-blah";
std::string client_secret = "2c796e9eblah";
std::string tenant_id = "b1c14d5c-blah";
//std::string account_key = "flMhWgNts+i/blah==";
//configureStatus = azureOptions.ConfigureAccountKeyCredential(
account_key );
configureStatus = azureOptions.ConfigureClientSecretCredential(
tenant_id, client_id, client_secret );
//configureStatus = azureOptions.ConfigureManagedIdentityCredential(
client_id );
if( false == configureStatus.ok() )
{
// Uh-oh, throw
}
std::shared_ptr<arrow::fs::AzureFileSystem> azureFileSystem;
arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>>
azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions );
if( true == azureFileSystemResult.ok() )
{
azureFileSystem = azureFileSystemResult.ValueOrDie();
}
else
{
// Uh-oh, throw
}
//const std::string path( "parquet/ParquetFiles/plain.parquet" );
const std::string path( "parquet/ParquetFiles/plain.parquet" );
std::shared_ptr<arrow::io::RandomAccessFile> arrowFile;
std::cout << "1\n";
arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> openResult
= azureFileSystem->OpenInputFile( path );
std::cout << "2\n";
And that is where things run off the rails. At this point, all I want to do is
open the input file, create a Parquet file reader like so:
std::unique_ptr<parquet::ParquetFileReader> parquet_reader =
parquet::ParquetFileReader::Open( arrowFile );
Then go about my business of reading/writing Parquet data as per normal. Ergo,
just as I do for the other filesystem objects. But the OpenInputFile() method
fails for the Azure use case scenario. If I attempt the account key
configuration, then the error I see is:
adls_read
Parquet file read commencing...
1
Parquet read error: map::at
Where the "1" is just a marker to show how far I got in the process of reading
a pre-existing Parquet file from the Azure server. Ergo, a low-brow means of
debugging. The cout is shown above. I don't get to "2", obviously.
When attempting the client secret credential auth, I see the following failure:
adls_read
Parquet file read commencing...
1
Parquet read error: GetToken(): error response: 401 Unauthorized
Then when attempting the Managed Identity auth configuration, I get the
following:
adls_read
Parquet file read commencing...
1
^C
Where the process just hangs and I have to interrupt out of it. Note that I
didn't have this level of difficulty when I implemented our support for GCS and
S3/AWS. Those were relatively straightforward. Azure however has been more
difficult; this should just work. I mean, you create the filesystem object,
then you are supposed to be able to use it in the same manner that you use any
other Arrow filesystem object. However I can't open a file and I suspect it is
due to some type of handshaking issue with Azure. Azure has all of these
moving parts; tenant ID, application/client ID, client secret, object ID (which
we don't use in this case) and the list goes on. Finally, I saw this in the
azurefs.h header at line 102:
// TODO(GH-38598): Add support for more auth methods.
// std::string connection_string;
// std::string sas_token;
But it seemed clear to me that this was referring to other auth methods than
those that have been implemented thus far (ergo client secret, account key,
etc.). Am I correct?
So my questions are:
1. Any ideas where I am going wrong here?
2. Has anyone else used the Azure filesystem object?
3. Has it worked for you?
4. If so, what was your approach?
Note that I did peruse the azurefs_test.cc for examples. I did see various
approaches. One involved invoking the MakeDataLakeServiceClient() method. It
wasn't clear if I needed to do that or not, but then I saw that this is done
during the private implementation of the AzureFileSystem's Make() method, thus:
static Result<std::unique_ptr<AzureFileSystem::Impl>> Make(AzureOptions
options,
io::IOContext
io_context) {
auto self = std::unique_ptr<AzureFileSystem::Impl>(
new AzureFileSystem::Impl(std::move(options), std::move(io_context)));
ARROW_ASSIGN_OR_RAISE(self->blob_service_client_,
self->options_.MakeBlobServiceClient());
ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_,
self->options_.MakeDataLakeServiceClient());
return self;
}
So it seemed like I wouldn't need to do it separately.
Anyway, I need to get this working ASAP, so I am open to feedback. I'll
continue plugging away.
Thanks!
Jerry