Hi Kou,

Thank you for the help.  Well, after enough digging, I figured it out.  The 
answer was and is that the code in the library works as expected.  And as I 
suspected, the issue was permissions related and lied on the Azure side.  
Specifically, to enable the client secret method of authentication to work, you 
must create a Storage Blob Data Contributor role for the storage account that 
you want to access.  Once I created this role, I was able to run the sample, 
standalone program that uses the Arrow C++ library to access Parquet data on an 
ADLS server.

Thanks again!
Jerry


-----Original Message-----
From: Sutou Kouhei <[email protected]>
Sent: Wednesday, July 24, 2024 3:42 AM
To: [email protected]
Subject: Re: Using the new Azure filesystem object (C++)

EXTERNAL

Hi,

Sorry for not responding this. I don't have enough time to try this yet... I 
hope that I can try this tomorrow...

(If anyone can help this, please do it.)

Thanks,
--
kou

In
 
<dm3pr05mb10543135ae1d24029ca133277f3...@dm3pr05mb10543.namprd05.prod.outlook.com>
  "RE: Using the new Azure filesystem object (C++)" on Wed, 24 Jul 2024 
05:08:52 +0000,
  "Jerry Adair via user" <[email protected]> wrote:

> Hi Kou,
>
> Alright, I have made it past the 401 error, which means that the recipient 
> doesn't know who you are.  I did this by creating a new storage account 
> within our tenant in the Azure portal.  Because I was the owner of the new 
> account, I could create a client secret for it.  I also learned that you need 
> the value of that client secret and not the secret ID when invoking the 
> ConfigureClientSecretCredential() method within the AzureOptions object.  
> However, I now encounter a 403 error code:
>
> Parquet read error: Unable to retrieve information for the file named 
> parquet/ParquetTestData/plain.parquet on the Azure server.  Status = IOError: 
> GetProperties for 
> 'https://protect.checkpoint.com/v2/___https://ecmtest4.blob.core.windows.net/parquet/ParquetTestData/plain.parquet___.YzJ1OnNhc2luc3RpdHV0ZTpjOm86MDBlN2NjNWIyNjgzM2ZhNDJiYjU0N2VmYTk2ODZlNjI6NjoxMjMzOjBlNDgyZDVhY2FkNWEzN2VmNzYxN2Q0YzZjNDg0Y2YwMzA1YjhmZTVlYTA0YmY5ZTdhY2Y0Y2VjZmE5MzBjMzM6cDpUOk4'
>  failed. GetFileInfo is unable to determine whether the path exists. Azure 
> Error: [] 403 This request is not authorized to perform this operation using 
> this permission.
>
> The 403 error code means that the recipient knows who you are but you don't 
> have permissions to complete the task that you are attempting.  So now I am 
> down-to a permissions issue, or so it would seem.  Therefore I have been 
> experimenting within the Azure portal, enabling all types of permissions and 
> such to get this to work.  However none of that experimentation has resulted 
> in a successful access of the resource on the Azure server (ADLS).
>
> Do you have any feedback on this?  What type of permission setting would 
> enable access?  What is preventing my test program from accessing the 
> resource?
>
> Thanks,
> Jerry
>
>
> -----Original Message-----
> From: Sutou Kouhei <[email protected]>
> Sent: Thursday, July 11, 2024 2:56 AM
> To: [email protected]
> Subject: Re: Using the new Azure filesystem object (C++)
>
> EXTERNAL
>
> Hi,
>
> Could you share how did you generate values for the client secret 
> configuration and the managed identity configuration?
> I'll try them.
>
> Thanks,
> --
> kou
>
> In
>  
> <dm3pr05mb1054325d88f8a46fd0b169c92f3...@dm3pr05mb10543.namprd05.prod.outlook.com>
>   "RE: Using the new Azure filesystem object (C++)" on Thu, 11 Jul 2024 
> 06:37:42 +0000,
>   "Jerry Adair via user" <[email protected]> wrote:
>
>> Hi Kou!
>>
>> Well, I thought it was strange too.  I was not aware that if data lake 
>> storage is available then AzureFS will use it automatically.  Thank you for 
>> that information, it helps.  With that in mind, I commented out both of 
>> those lines and just let the default values be assigned (which occurs in 
>> azurefs.h).
>>
>> With that modification, if I attempt an account key configuration, thus:
>>
>>       configureStatus = azureOptions.ConfigureAccountKeyCredential(
>> account_key );
>>
>> Then it works!  I can read the Parquet file via the methods in the Parquet 
>> library!
>>
>> However if I use the client secret configuration, thus:
>>
>>       configureStatus = azureOptions.ConfigureClientSecretCredential(
>> tenant_id, client_id, client_secret );
>>
>> Then I see the unauthorized error, thus:
>>
>> adls_read
>> Parquet file read commencing...
>> configureStatus = OK
>> 1
>> Parquet read error: GetToken(): error response: 401 Unauthorized
>>
>> And if I use the managed identity configuration, thus:
>>
>>       configureStatus =
>> azureOptions.ConfigureManagedIdentityCredential( client_id );
>>
>> Then I see the hang, thus:
>>
>> adls_read
>> Parquet file read commencing...
>> configureStatus = OK
>> 1
>> ^C
>>
>> So I dunno about those configuration attempts.  I have double-checked the 
>> values via the Azure portal that we use and those values are correct.  So 
>> perhaps there is some other type of limitation that is being imposed here?  
>> I'd like to offer the user different means of authenticating to get their 
>> credentials, ergo they could use client secret or account key or managed 
>> identity, etc.  However at the moment only account key is working.  I'll 
>> continue to see what I can figure out.  If you've seen this type of 
>> phenomenon in the past and recognize the error that is at-play, I'd 
>> appreciate any feedback.
>>
>> Thanks!
>> Jerry
>>
>>
>> -----Original Message-----
>> From: Sutou Kouhei <[email protected]>
>> Sent: Wednesday, July 10, 2024 4:34 PM
>> To: [email protected]
>> Subject: Re: Using the new Azure filesystem object (C++)
>>
>> EXTERNAL
>>
>> Hi,
>>
>>>       azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If 
>>> I don't do this, then the
>>>
>>> // blob.core.windows.net is used;
>>>
>>> // I want dfs not blob, so... not certain
>>>
>>> // why that happens either
>>
>> This is strange. In general, you should not do this.
>> AzureFS uses both of blob storage API and data lake storage API. If data 
>> lake storage API is available, AzureFS uses it automatically. So you should 
>> not change blob_storage_authority.
>>
>> If you don't have this line, what was happen?
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In
>>  
>> <dm3pr05mb1054334eeaeae4a95805de322f3...@dm3pr05mb10543.namprd05.prod.outlook.com>
>>   "Using the new Azure filesystem object (C++)" on Wed, 10 Jul 2024 16:58:52 
>> +0000,
>>   "Jerry Adair via user" <[email protected]> wrote:
>>
>>> Hi-
>>>
>>> I am attempting to use the new Azure filesystem object in C++.  
>>> Arrow/Parquet version 16.0.0.  I already have code that works for GCS and 
>>> AWS/S3.  I have been waiting for quite a while to see the new Azure 
>>> filesystem object released.  Now that it has in this version (16.0.0) I 
>>> have been trying to use it.  Without success.  I presumed that it would 
>>> work in the same manner in which the GCS and S3/AWS filesystem objects 
>>> work.  You create the object, then you can use it in the same manner that 
>>> you used the other filesystem objects.  Note that I am not using Arrow 
>>> methods to read/write the data but rather the Parquet methods.  This works 
>>> for local, GCS and S3/AWS.  However I cannot open a file on Azure.  It 
>>> seems like no matter which authentication method I try to use, it doesn't 
>>> work.  And I get different results depending on which auth approach I take 
>>> (client secret versus account key, etc.).  Here is a code summary of what I 
>>> am trying to do:
>>>
>>>       arrow::fs::AzureOptions   azureOptions;
>>>       arrow::Status             configureStatus = arrow::Status::OK();
>>>
>>>      // exact values obfuscated
>>>       azureOptions.account_name = "mytest";
>>>       azureOptions.dfs_storage_authority = ".dfs.core.windows.net";
>>>       azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If 
>>> I don't do this, then the
>>>                                                                      // 
>>> blob.core.windows.net is used;
>>>                                                                      // I 
>>> want dfs not blob, so... not certain
>>>                                                                      // why 
>>> that happens either
>>>       std::string  client_id  = "3f061894-blah";
>>>       std::string  client_secret  = "2c796e9eblah";
>>>       std::string  tenant_id  = "b1c14d5c-blah";
>>>       //std::string  account_key  = "flMhWgNts+i/blah==";
>>>
>>>
>>>       //configureStatus = azureOptions.ConfigureAccountKeyCredential( 
>>> account_key );
>>>       configureStatus = azureOptions.ConfigureClientSecretCredential( 
>>> tenant_id, client_id, client_secret );
>>>       //configureStatus = azureOptions.ConfigureManagedIdentityCredential( 
>>> client_id );
>>>       if( false == configureStatus.ok() )
>>>       {
>>>          // Uh-oh, throw
>>>
>>>       }
>>>
>>>       std::shared_ptr<arrow::fs::AzureFileSystem>   azureFileSystem;
>>>       arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>>   
>>> azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions );
>>>       if( true == azureFileSystemResult.ok() )
>>>       {
>>>          azureFileSystem = azureFileSystemResult.ValueOrDie();
>>>
>>>       }
>>>       else
>>>       {
>>>          // Uh-oh, throw
>>>
>>>       }
>>>
>>>          //const std::string path( "parquet/ParquetFiles/plain.parquet" );
>>>          const std::string path( "parquet/ParquetFiles/plain.parquet" );
>>>          std::shared_ptr<arrow::io::RandomAccessFile> arrowFile;
>>> std::cout << "1\n";
>>>          arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>>
>>> openResult = azureFileSystem->OpenInputFile( path ); std::cout <<
>>> "2\n";
>>>
>>> And that is where things run off the rails.  At this point, all I want to 
>>> do is open the input file, create a Parquet file reader like so:
>>>
>>>          std::unique_ptr<parquet::ParquetFileReader> parquet_reader
>>> = parquet::ParquetFileReader::Open( arrowFile );
>>>
>>> Then go about my business of reading/writing Parquet data as per normal.  
>>> Ergo, just as I do for the other filesystem objects.  But the 
>>> OpenInputFile() method fails for the Azure use case scenario.  If I attempt 
>>> the account key configuration, then the error I see is:
>>>
>>> adls_read
>>> Parquet file read commencing...
>>> 1
>>> Parquet read error: map::at
>>>
>>> Where the "1" is just a marker to show how far I got in the process of 
>>> reading a pre-existing Parquet file from the Azure server.  Ergo, a 
>>> low-brow means of debugging.  The cout is shown above.  I don't get to "2", 
>>> obviously.
>>>
>>> When attempting the client secret credential auth, I see the following 
>>> failure:
>>>
>>> adls_read
>>> Parquet file read commencing...
>>> 1
>>> Parquet read error: GetToken(): error response: 401 Unauthorized
>>>
>>> Then when attempting the Managed Identity auth configuration, I get the 
>>> following:
>>>
>>> adls_read
>>> Parquet file read commencing...
>>> 1
>>> ^C
>>>
>>> Where the process just hangs and I have to interrupt out of it.  Note that 
>>> I didn't have this level of difficulty when I implemented our support for 
>>> GCS and S3/AWS.  Those were relatively straightforward.  Azure however has 
>>> been more difficult;  this should just work.  I mean, you create the 
>>> filesystem object, then you are supposed to be able to use it in the same 
>>> manner that you use any other Arrow filesystem object.  However I can't 
>>> open a file and I suspect it is due to some type of handshaking issue with 
>>> Azure.  Azure has all of these moving parts; tenant ID, application/client 
>>> ID, client secret, object ID (which we don't use in this case) and the list 
>>> goes on.  Finally, I saw this in the azurefs.h header at line 102:
>>>
>>>   // TODO(GH-38598): Add support for more auth methods.
>>>   // std::string connection_string;
>>>   // std::string sas_token;
>>>
>>> But it seemed clear to me that this was referring to other auth methods 
>>> than those that have been implemented thus far (ergo client secret, account 
>>> key, etc.).  Am I correct?
>>>
>>> So my questions are:
>>>
>>>   1.  Any ideas where I am going wrong here?
>>>   2.  Has anyone else used the Azure filesystem object?
>>>   3.  Has it worked for you?
>>>   4.  If so, what was your approach?
>>>
>>> Note that I did peruse the azurefs_test.cc for examples.  I did see various 
>>> approaches.  One involved invoking the MakeDataLakeServiceClient() method.  
>>> It wasn't clear if I needed to do that or not, but then I saw that this is 
>>> done during the private implementation of the AzureFileSystem's Make() 
>>> method, thus:
>>>
>>>   static Result<std::unique_ptr<AzureFileSystem::Impl>> Make(AzureOptions 
>>> options,
>>>                                                              io::IOContext 
>>> io_context) {
>>>     auto self = std::unique_ptr<AzureFileSystem::Impl>(
>>>         new AzureFileSystem::Impl(std::move(options), 
>>> std::move(io_context)));
>>>     ARROW_ASSIGN_OR_RAISE(self->blob_service_client_,
>>>                           self->options_.MakeBlobServiceClient());
>>>     ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_,
>>>                           self->options_.MakeDataLakeServiceClient());
>>>     return self;
>>>   }
>>>
>>> So it seemed like I wouldn't need to do it separately.
>>>
>>> Anyway, I need to get this working ASAP, so I am open to feedback.  I'll 
>>> continue plugging away.
>>>
>>> Thanks!
>>> Jerry

Reply via email to