Hi Kou!

Well, I thought it was strange too.  I was not aware that if data lake storage 
is available then AzureFS will use it automatically.  Thank you for that 
information, it helps.  With that in mind, I commented out both of those lines 
and just let the default values be assigned (which occurs in azurefs.h).

With that modification, if I attempt an account key configuration, thus:

      configureStatus = azureOptions.ConfigureAccountKeyCredential( account_key 
);

Then it works!  I can read the Parquet file via the methods in the Parquet 
library!

However if I use the client secret configuration, thus:

      configureStatus = azureOptions.ConfigureClientSecretCredential( 
tenant_id, client_id, client_secret );

Then I see the unauthorized error, thus:

adls_read
Parquet file read commencing...
configureStatus = OK
1
Parquet read error: GetToken(): error response: 401 Unauthorized

And if I use the managed identity configuration, thus:

      configureStatus = azureOptions.ConfigureManagedIdentityCredential( 
client_id );

Then I see the hang, thus:

adls_read
Parquet file read commencing...
configureStatus = OK
1
^C

So I dunno about those configuration attempts.  I have double-checked the 
values via the Azure portal that we use and those values are correct.  So 
perhaps there is some other type of limitation that is being imposed here?  I'd 
like to offer the user different means of authenticating to get their 
credentials, ergo they could use client secret or account key or managed 
identity, etc.  However at the moment only account key is working.  I'll 
continue to see what I can figure out.  If you've seen this type of phenomenon 
in the past and recognize the error that is at-play, I'd appreciate any 
feedback.

Thanks!
Jerry


-----Original Message-----
From: Sutou Kouhei <[email protected]> 
Sent: Wednesday, July 10, 2024 4:34 PM
To: [email protected]
Subject: Re: Using the new Azure filesystem object (C++)

EXTERNAL

Hi,

>       azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If I 
> don't do this, then the
>                                                                      // 
> blob.core.windows.net is used;
>                                                                      // I 
> want dfs not blob, so... not certain
>                                                                      
> // why that happens either

This is strange. In general, you should not do this.
AzureFS uses both of blob storage API and data lake storage API. If data lake 
storage API is available, AzureFS uses it automatically. So you should not 
change blob_storage_authority.

If you don't have this line, what was happen?


Thanks,
--
kou

In
 
<dm3pr05mb1054334eeaeae4a95805de322f3...@dm3pr05mb10543.namprd05.prod.outlook.com>
  "Using the new Azure filesystem object (C++)" on Wed, 10 Jul 2024 16:58:52 
+0000,
  "Jerry Adair via user" <[email protected]> wrote:

> Hi-
>
> I am attempting to use the new Azure filesystem object in C++.  Arrow/Parquet 
> version 16.0.0.  I already have code that works for GCS and AWS/S3.  I have 
> been waiting for quite a while to see the new Azure filesystem object 
> released.  Now that it has in this version (16.0.0) I have been trying to use 
> it.  Without success.  I presumed that it would work in the same manner in 
> which the GCS and S3/AWS filesystem objects work.  You create the object, 
> then you can use it in the same manner that you used the other filesystem 
> objects.  Note that I am not using Arrow methods to read/write the data but 
> rather the Parquet methods.  This works for local, GCS and S3/AWS.  However I 
> cannot open a file on Azure.  It seems like no matter which authentication 
> method I try to use, it doesn't work.  And I get different results depending 
> on which auth approach I take (client secret versus account key, etc.).  Here 
> is a code summary of what I am trying to do:
>
>       arrow::fs::AzureOptions   azureOptions;
>       arrow::Status             configureStatus = arrow::Status::OK();
>
>      // exact values obfuscated
>       azureOptions.account_name = "mytest";
>       azureOptions.dfs_storage_authority = ".dfs.core.windows.net";
>       azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If I 
> don't do this, then the
>                                                                      // 
> blob.core.windows.net is used;
>                                                                      // I 
> want dfs not blob, so... not certain
>                                                                      // why 
> that happens either
>       std::string  client_id  = "3f061894-blah";
>       std::string  client_secret  = "2c796e9eblah";
>       std::string  tenant_id  = "b1c14d5c-blah";
>       //std::string  account_key  = "flMhWgNts+i/blah==";
>
>
>       //configureStatus = azureOptions.ConfigureAccountKeyCredential( 
> account_key );
>       configureStatus = azureOptions.ConfigureClientSecretCredential( 
> tenant_id, client_id, client_secret );
>       //configureStatus = azureOptions.ConfigureManagedIdentityCredential( 
> client_id );
>       if( false == configureStatus.ok() )
>       {
>          // Uh-oh, throw
>
>       }
>
>       std::shared_ptr<arrow::fs::AzureFileSystem>   azureFileSystem;
>       arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>>   
> azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions );
>       if( true == azureFileSystemResult.ok() )
>       {
>          azureFileSystem = azureFileSystemResult.ValueOrDie();
>
>       }
>       else
>       {
>          // Uh-oh, throw
>
>       }
>
>          //const std::string path( "parquet/ParquetFiles/plain.parquet" );
>          const std::string path( "parquet/ParquetFiles/plain.parquet" );
>          std::shared_ptr<arrow::io::RandomAccessFile> arrowFile; 
> std::cout << "1\n";
>          arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> 
> openResult = azureFileSystem->OpenInputFile( path ); std::cout << 
> "2\n";
>
> And that is where things run off the rails.  At this point, all I want to do 
> is open the input file, create a Parquet file reader like so:
>
>          std::unique_ptr<parquet::ParquetFileReader> parquet_reader = 
> parquet::ParquetFileReader::Open( arrowFile );
>
> Then go about my business of reading/writing Parquet data as per normal.  
> Ergo, just as I do for the other filesystem objects.  But the OpenInputFile() 
> method fails for the Azure use case scenario.  If I attempt the account key 
> configuration, then the error I see is:
>
> adls_read
> Parquet file read commencing...
> 1
> Parquet read error: map::at
>
> Where the "1" is just a marker to show how far I got in the process of 
> reading a pre-existing Parquet file from the Azure server.  Ergo, a low-brow 
> means of debugging.  The cout is shown above.  I don't get to "2", obviously.
>
> When attempting the client secret credential auth, I see the following 
> failure:
>
> adls_read
> Parquet file read commencing...
> 1
> Parquet read error: GetToken(): error response: 401 Unauthorized
>
> Then when attempting the Managed Identity auth configuration, I get the 
> following:
>
> adls_read
> Parquet file read commencing...
> 1
> ^C
>
> Where the process just hangs and I have to interrupt out of it.  Note that I 
> didn't have this level of difficulty when I implemented our support for GCS 
> and S3/AWS.  Those were relatively straightforward.  Azure however has been 
> more difficult;  this should just work.  I mean, you create the filesystem 
> object, then you are supposed to be able to use it in the same manner that 
> you use any other Arrow filesystem object.  However I can't open a file and I 
> suspect it is due to some type of handshaking issue with Azure.  Azure has 
> all of these moving parts; tenant ID, application/client ID, client secret, 
> object ID (which we don't use in this case) and the list goes on.  Finally, I 
> saw this in the azurefs.h header at line 102:
>
>   // TODO(GH-38598): Add support for more auth methods.
>   // std::string connection_string;
>   // std::string sas_token;
>
> But it seemed clear to me that this was referring to other auth methods than 
> those that have been implemented thus far (ergo client secret, account key, 
> etc.).  Am I correct?
>
> So my questions are:
>
>   1.  Any ideas where I am going wrong here?
>   2.  Has anyone else used the Azure filesystem object?
>   3.  Has it worked for you?
>   4.  If so, what was your approach?
>
> Note that I did peruse the azurefs_test.cc for examples.  I did see various 
> approaches.  One involved invoking the MakeDataLakeServiceClient() method.  
> It wasn't clear if I needed to do that or not, but then I saw that this is 
> done during the private implementation of the AzureFileSystem's Make() 
> method, thus:
>
>   static Result<std::unique_ptr<AzureFileSystem::Impl>> Make(AzureOptions 
> options,
>                                                              io::IOContext 
> io_context) {
>     auto self = std::unique_ptr<AzureFileSystem::Impl>(
>         new AzureFileSystem::Impl(std::move(options), std::move(io_context)));
>     ARROW_ASSIGN_OR_RAISE(self->blob_service_client_,
>                           self->options_.MakeBlobServiceClient());
>     ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_,
>                           self->options_.MakeDataLakeServiceClient());
>     return self;
>   }
>
> So it seemed like I wouldn't need to do it separately.
>
> Anyway, I need to get this working ASAP, so I am open to feedback.  I'll 
> continue plugging away.
>
> Thanks!
> Jerry

Reply via email to