Hi Kou, Thank you for the help. Well, after enough digging, I figured it out. The answer was and is that the code in the library works as expected. And as I suspected, the issue was permissions related and lied on the Azure side. Specifically, to enable the client secret method of authentication to work, you must create a Storage Blob Data Contributor role for the storage account that you want to access. Once I created this role, I was able to run the sample, standalone program that uses the Arrow C++ library to access Parquet data on an ADLS server.
Thanks again! Jerry -----Original Message----- From: Sutou Kouhei <[email protected]> Sent: Wednesday, July 24, 2024 3:42 AM To: [email protected] Subject: Re: Using the new Azure filesystem object (C++) EXTERNAL Hi, Sorry for not responding this. I don't have enough time to try this yet... I hope that I can try this tomorrow... (If anyone can help this, please do it.) Thanks, -- kou In <dm3pr05mb10543135ae1d24029ca133277f3...@dm3pr05mb10543.namprd05.prod.outlook.com> "RE: Using the new Azure filesystem object (C++)" on Wed, 24 Jul 2024 05:08:52 +0000, "Jerry Adair via user" <[email protected]> wrote: > Hi Kou, > > Alright, I have made it past the 401 error, which means that the recipient > doesn't know who you are. I did this by creating a new storage account > within our tenant in the Azure portal. Because I was the owner of the new > account, I could create a client secret for it. I also learned that you need > the value of that client secret and not the secret ID when invoking the > ConfigureClientSecretCredential() method within the AzureOptions object. > However, I now encounter a 403 error code: > > Parquet read error: Unable to retrieve information for the file named > parquet/ParquetTestData/plain.parquet on the Azure server. Status = IOError: > GetProperties for > 'https://protect.checkpoint.com/v2/___https://ecmtest4.blob.core.windows.net/parquet/ParquetTestData/plain.parquet___.YzJ1OnNhc2luc3RpdHV0ZTpjOm86MDBlN2NjNWIyNjgzM2ZhNDJiYjU0N2VmYTk2ODZlNjI6NjoxMjMzOjBlNDgyZDVhY2FkNWEzN2VmNzYxN2Q0YzZjNDg0Y2YwMzA1YjhmZTVlYTA0YmY5ZTdhY2Y0Y2VjZmE5MzBjMzM6cDpUOk4' > failed. GetFileInfo is unable to determine whether the path exists. Azure > Error: [] 403 This request is not authorized to perform this operation using > this permission. > > The 403 error code means that the recipient knows who you are but you don't > have permissions to complete the task that you are attempting. So now I am > down-to a permissions issue, or so it would seem. Therefore I have been > experimenting within the Azure portal, enabling all types of permissions and > such to get this to work. However none of that experimentation has resulted > in a successful access of the resource on the Azure server (ADLS). > > Do you have any feedback on this? What type of permission setting would > enable access? What is preventing my test program from accessing the > resource? > > Thanks, > Jerry > > > -----Original Message----- > From: Sutou Kouhei <[email protected]> > Sent: Thursday, July 11, 2024 2:56 AM > To: [email protected] > Subject: Re: Using the new Azure filesystem object (C++) > > EXTERNAL > > Hi, > > Could you share how did you generate values for the client secret > configuration and the managed identity configuration? > I'll try them. > > Thanks, > -- > kou > > In > > <dm3pr05mb1054325d88f8a46fd0b169c92f3...@dm3pr05mb10543.namprd05.prod.outlook.com> > "RE: Using the new Azure filesystem object (C++)" on Thu, 11 Jul 2024 > 06:37:42 +0000, > "Jerry Adair via user" <[email protected]> wrote: > >> Hi Kou! >> >> Well, I thought it was strange too. I was not aware that if data lake >> storage is available then AzureFS will use it automatically. Thank you for >> that information, it helps. With that in mind, I commented out both of >> those lines and just let the default values be assigned (which occurs in >> azurefs.h). >> >> With that modification, if I attempt an account key configuration, thus: >> >> configureStatus = azureOptions.ConfigureAccountKeyCredential( >> account_key ); >> >> Then it works! I can read the Parquet file via the methods in the Parquet >> library! >> >> However if I use the client secret configuration, thus: >> >> configureStatus = azureOptions.ConfigureClientSecretCredential( >> tenant_id, client_id, client_secret ); >> >> Then I see the unauthorized error, thus: >> >> adls_read >> Parquet file read commencing... >> configureStatus = OK >> 1 >> Parquet read error: GetToken(): error response: 401 Unauthorized >> >> And if I use the managed identity configuration, thus: >> >> configureStatus = >> azureOptions.ConfigureManagedIdentityCredential( client_id ); >> >> Then I see the hang, thus: >> >> adls_read >> Parquet file read commencing... >> configureStatus = OK >> 1 >> ^C >> >> So I dunno about those configuration attempts. I have double-checked the >> values via the Azure portal that we use and those values are correct. So >> perhaps there is some other type of limitation that is being imposed here? >> I'd like to offer the user different means of authenticating to get their >> credentials, ergo they could use client secret or account key or managed >> identity, etc. However at the moment only account key is working. I'll >> continue to see what I can figure out. If you've seen this type of >> phenomenon in the past and recognize the error that is at-play, I'd >> appreciate any feedback. >> >> Thanks! >> Jerry >> >> >> -----Original Message----- >> From: Sutou Kouhei <[email protected]> >> Sent: Wednesday, July 10, 2024 4:34 PM >> To: [email protected] >> Subject: Re: Using the new Azure filesystem object (C++) >> >> EXTERNAL >> >> Hi, >> >>> azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If >>> I don't do this, then the >>> >>> // blob.core.windows.net is used; >>> >>> // I want dfs not blob, so... not certain >>> >>> // why that happens either >> >> This is strange. In general, you should not do this. >> AzureFS uses both of blob storage API and data lake storage API. If data >> lake storage API is available, AzureFS uses it automatically. So you should >> not change blob_storage_authority. >> >> If you don't have this line, what was happen? >> >> >> Thanks, >> -- >> kou >> >> In >> >> <dm3pr05mb1054334eeaeae4a95805de322f3...@dm3pr05mb10543.namprd05.prod.outlook.com> >> "Using the new Azure filesystem object (C++)" on Wed, 10 Jul 2024 16:58:52 >> +0000, >> "Jerry Adair via user" <[email protected]> wrote: >> >>> Hi- >>> >>> I am attempting to use the new Azure filesystem object in C++. >>> Arrow/Parquet version 16.0.0. I already have code that works for GCS and >>> AWS/S3. I have been waiting for quite a while to see the new Azure >>> filesystem object released. Now that it has in this version (16.0.0) I >>> have been trying to use it. Without success. I presumed that it would >>> work in the same manner in which the GCS and S3/AWS filesystem objects >>> work. You create the object, then you can use it in the same manner that >>> you used the other filesystem objects. Note that I am not using Arrow >>> methods to read/write the data but rather the Parquet methods. This works >>> for local, GCS and S3/AWS. However I cannot open a file on Azure. It >>> seems like no matter which authentication method I try to use, it doesn't >>> work. And I get different results depending on which auth approach I take >>> (client secret versus account key, etc.). Here is a code summary of what I >>> am trying to do: >>> >>> arrow::fs::AzureOptions azureOptions; >>> arrow::Status configureStatus = arrow::Status::OK(); >>> >>> // exact values obfuscated >>> azureOptions.account_name = "mytest"; >>> azureOptions.dfs_storage_authority = ".dfs.core.windows.net"; >>> azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // If >>> I don't do this, then the >>> // >>> blob.core.windows.net is used; >>> // I >>> want dfs not blob, so... not certain >>> // why >>> that happens either >>> std::string client_id = "3f061894-blah"; >>> std::string client_secret = "2c796e9eblah"; >>> std::string tenant_id = "b1c14d5c-blah"; >>> //std::string account_key = "flMhWgNts+i/blah=="; >>> >>> >>> //configureStatus = azureOptions.ConfigureAccountKeyCredential( >>> account_key ); >>> configureStatus = azureOptions.ConfigureClientSecretCredential( >>> tenant_id, client_id, client_secret ); >>> //configureStatus = azureOptions.ConfigureManagedIdentityCredential( >>> client_id ); >>> if( false == configureStatus.ok() ) >>> { >>> // Uh-oh, throw >>> >>> } >>> >>> std::shared_ptr<arrow::fs::AzureFileSystem> azureFileSystem; >>> arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>> >>> azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions ); >>> if( true == azureFileSystemResult.ok() ) >>> { >>> azureFileSystem = azureFileSystemResult.ValueOrDie(); >>> >>> } >>> else >>> { >>> // Uh-oh, throw >>> >>> } >>> >>> //const std::string path( "parquet/ParquetFiles/plain.parquet" ); >>> const std::string path( "parquet/ParquetFiles/plain.parquet" ); >>> std::shared_ptr<arrow::io::RandomAccessFile> arrowFile; >>> std::cout << "1\n"; >>> arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> >>> openResult = azureFileSystem->OpenInputFile( path ); std::cout << >>> "2\n"; >>> >>> And that is where things run off the rails. At this point, all I want to >>> do is open the input file, create a Parquet file reader like so: >>> >>> std::unique_ptr<parquet::ParquetFileReader> parquet_reader >>> = parquet::ParquetFileReader::Open( arrowFile ); >>> >>> Then go about my business of reading/writing Parquet data as per normal. >>> Ergo, just as I do for the other filesystem objects. But the >>> OpenInputFile() method fails for the Azure use case scenario. If I attempt >>> the account key configuration, then the error I see is: >>> >>> adls_read >>> Parquet file read commencing... >>> 1 >>> Parquet read error: map::at >>> >>> Where the "1" is just a marker to show how far I got in the process of >>> reading a pre-existing Parquet file from the Azure server. Ergo, a >>> low-brow means of debugging. The cout is shown above. I don't get to "2", >>> obviously. >>> >>> When attempting the client secret credential auth, I see the following >>> failure: >>> >>> adls_read >>> Parquet file read commencing... >>> 1 >>> Parquet read error: GetToken(): error response: 401 Unauthorized >>> >>> Then when attempting the Managed Identity auth configuration, I get the >>> following: >>> >>> adls_read >>> Parquet file read commencing... >>> 1 >>> ^C >>> >>> Where the process just hangs and I have to interrupt out of it. Note that >>> I didn't have this level of difficulty when I implemented our support for >>> GCS and S3/AWS. Those were relatively straightforward. Azure however has >>> been more difficult; this should just work. I mean, you create the >>> filesystem object, then you are supposed to be able to use it in the same >>> manner that you use any other Arrow filesystem object. However I can't >>> open a file and I suspect it is due to some type of handshaking issue with >>> Azure. Azure has all of these moving parts; tenant ID, application/client >>> ID, client secret, object ID (which we don't use in this case) and the list >>> goes on. Finally, I saw this in the azurefs.h header at line 102: >>> >>> // TODO(GH-38598): Add support for more auth methods. >>> // std::string connection_string; >>> // std::string sas_token; >>> >>> But it seemed clear to me that this was referring to other auth methods >>> than those that have been implemented thus far (ergo client secret, account >>> key, etc.). Am I correct? >>> >>> So my questions are: >>> >>> 1. Any ideas where I am going wrong here? >>> 2. Has anyone else used the Azure filesystem object? >>> 3. Has it worked for you? >>> 4. If so, what was your approach? >>> >>> Note that I did peruse the azurefs_test.cc for examples. I did see various >>> approaches. One involved invoking the MakeDataLakeServiceClient() method. >>> It wasn't clear if I needed to do that or not, but then I saw that this is >>> done during the private implementation of the AzureFileSystem's Make() >>> method, thus: >>> >>> static Result<std::unique_ptr<AzureFileSystem::Impl>> Make(AzureOptions >>> options, >>> io::IOContext >>> io_context) { >>> auto self = std::unique_ptr<AzureFileSystem::Impl>( >>> new AzureFileSystem::Impl(std::move(options), >>> std::move(io_context))); >>> ARROW_ASSIGN_OR_RAISE(self->blob_service_client_, >>> self->options_.MakeBlobServiceClient()); >>> ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_, >>> self->options_.MakeDataLakeServiceClient()); >>> return self; >>> } >>> >>> So it seemed like I wouldn't need to do it separately. >>> >>> Anyway, I need to get this working ASAP, so I am open to feedback. I'll >>> continue plugging away. >>> >>> Thanks! >>> Jerry
