Hi-

I am attempting to use the new Azure filesystem object in C++.  Arrow/Parquet 
version 16.0.0.  I already have code that works for GCS and AWS/S3.  I have 
been waiting for quite a while to see the new Azure filesystem object released. 
 Now that it has in this version (16.0.0) I have been trying to use it.  
Without success.  I presumed that it would work in the same manner in which the 
GCS and S3/AWS filesystem objects work.  You create the object, then you can 
use it in the same manner that you used the other filesystem objects.  Note 
that I am not using Arrow methods to read/write the data but rather the Parquet 
methods.  This works for local, GCS and S3/AWS.  However I cannot open a file 
on Azure.  It seems like no matter which authentication method I try to use, it 
doesn't work.  And I get different results depending on which auth approach I 
take (client secret versus account key, etc.).  Here is a code summary of what 
I am trying to do:

      arrow::fs::AzureOptions   azureOptions;
      arrow::Status             configureStatus = arrow::Status::OK();

     // exact values obfuscated
      azureOptions.account_name = "mytest";
      azureOptions.dfs_storage_authority = ".dfs.core.windows.net";
      azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If I 
don't do this, then the
                                                                     // 
blob.core.windows.net is used;
                                                                     // I want 
dfs not blob, so... not certain
                                                                     // why 
that happens either
      std::string  client_id  = "3f061894-blah";
      std::string  client_secret  = "2c796e9eblah";
      std::string  tenant_id  = "b1c14d5c-blah";
      //std::string  account_key  = "flMhWgNts+i/blah==";


      //configureStatus = azureOptions.ConfigureAccountKeyCredential( 
account_key );
      configureStatus = azureOptions.ConfigureClientSecretCredential( 
tenant_id, client_id, client_secret );
      //configureStatus = azureOptions.ConfigureManagedIdentityCredential( 
client_id );
      if( false == configureStatus.ok() )
      {
         // Uh-oh, throw

      }

      std::shared_ptr<arrow::fs::AzureFileSystem>   azureFileSystem;
      arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>>   
azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions );
      if( true == azureFileSystemResult.ok() )
      {
         azureFileSystem = azureFileSystemResult.ValueOrDie();

      }
      else
      {
         // Uh-oh, throw

      }

         //const std::string path( "parquet/ParquetFiles/plain.parquet" );
         const std::string path( "parquet/ParquetFiles/plain.parquet" );
         std::shared_ptr<arrow::io::RandomAccessFile> arrowFile;
std::cout << "1\n";
         arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> openResult 
= azureFileSystem->OpenInputFile( path );
std::cout << "2\n";

And that is where things run off the rails.  At this point, all I want to do is 
open the input file, create a Parquet file reader like so:

         std::unique_ptr<parquet::ParquetFileReader> parquet_reader = 
parquet::ParquetFileReader::Open( arrowFile );

Then go about my business of reading/writing Parquet data as per normal.  Ergo, 
just as I do for the other filesystem objects.  But the OpenInputFile() method 
fails for the Azure use case scenario.  If I attempt the account key 
configuration, then the error I see is:

adls_read
Parquet file read commencing...
1
Parquet read error: map::at

Where the "1" is just a marker to show how far I got in the process of reading 
a pre-existing Parquet file from the Azure server.  Ergo, a low-brow means of 
debugging.  The cout is shown above.  I don't get to "2", obviously.

When attempting the client secret credential auth, I see the following failure:

adls_read
Parquet file read commencing...
1
Parquet read error: GetToken(): error response: 401 Unauthorized

Then when attempting the Managed Identity auth configuration, I get the 
following:

adls_read
Parquet file read commencing...
1
^C

Where the process just hangs and I have to interrupt out of it.  Note that I 
didn't have this level of difficulty when I implemented our support for GCS and 
S3/AWS.  Those were relatively straightforward.  Azure however has been more 
difficult;  this should just work.  I mean, you create the filesystem object, 
then you are supposed to be able to use it in the same manner that you use any 
other Arrow filesystem object.  However I can't open a file and I suspect it is 
due to some type of handshaking issue with Azure.  Azure has all of these 
moving parts; tenant ID, application/client ID, client secret, object ID (which 
we don't use in this case) and the list goes on.  Finally, I saw this in the 
azurefs.h header at line 102:

  // TODO(GH-38598): Add support for more auth methods.
  // std::string connection_string;
  // std::string sas_token;

But it seemed clear to me that this was referring to other auth methods than 
those that have been implemented thus far (ergo client secret, account key, 
etc.).  Am I correct?

So my questions are:

  1.  Any ideas where I am going wrong here?
  2.  Has anyone else used the Azure filesystem object?
  3.  Has it worked for you?
  4.  If so, what was your approach?

Note that I did peruse the azurefs_test.cc for examples.  I did see various 
approaches.  One involved invoking the MakeDataLakeServiceClient() method.  It 
wasn't clear if I needed to do that or not, but then I saw that this is done 
during the private implementation of the AzureFileSystem's Make() method, thus:

  static Result<std::unique_ptr<AzureFileSystem::Impl>> Make(AzureOptions 
options,
                                                             io::IOContext 
io_context) {
    auto self = std::unique_ptr<AzureFileSystem::Impl>(
        new AzureFileSystem::Impl(std::move(options), std::move(io_context)));
    ARROW_ASSIGN_OR_RAISE(self->blob_service_client_,
                          self->options_.MakeBlobServiceClient());
    ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_,
                          self->options_.MakeDataLakeServiceClient());
    return self;
  }

So it seemed like I wouldn't need to do it separately.

Anyway, I need to get this working ASAP, so I am open to feedback.  I'll 
continue plugging away.

Thanks!
Jerry

Reply via email to