TomasRachunek opened a new issue, #701:
URL: https://github.com/apache/arrow-rs-object-store/issues/701

   **Describe the bug**
   When a bulk delete is attempted for a Fabric Lakehouse over abfss://, an 
incorrect Blob URL containing only the Workspace ID and not the Artifact ID is 
used to issue the bulk deletes, triggering a 400 Bad Request error:
   `{"error":{"code":"BadRequest","message":"Either WorkspaceId or ArtifactId 
are missing in the request"}}`
   The incorrect URL format: 
`https://onelake.blob.fabric.microsoft.com/<WORKSPACE_ID>?restype=container&comp=batch`
   Expected URL format: 
`https://onelake.blob.fabric.microsoft.com/<WORKSPACE_ID>/<ARTIFACT_ID>?restype=container&comp=batch`
   
   **To Reproduce**
   Trigger a bulk delete when writing to an Azure Lakehouse through an abfss:// 
URI.
   
   **Expected behavior**
   Bulk delete for Azure Lakehouse functions as intended.
   
   **Additional context**
   I have a custom Python data processing pipeline in MS Fabric using 
`deltalake` which overwrites a table in a Lakehouse using an abfss:// URI.
   However, recently, some tables have started experiencing an error with 
malformed bulk delete URLs being used during a write.
   I have managed to trace that this error is caused by URL generation within 
arrow-rs-object-store.
   In the abfss:// case the Artifact/Lakehouse ID is in the first part of the 
path, but `bulk_delete_request` starts building the URL from the root, 
stripping the Artifact ID: 
https://github.com/apache/arrow-rs-object-store/blob/main/src/azure/client.rs#L678
   The resulting URL then only contains the Workspace ID and is missing the 
Artifact ID for the Lakehouse.
   
   Error log:
   ```
     File 
"/home/trusted-service-user/jupyter-env/python3.11/lib/python3.11/site-packages/deltalake/writer/writer.py",
 line 131, in write_deltalake
       table._table.write(
   OSError: Generic MicrosoftAzure error
             ↳ Error performing bulk delete request
              ↳ Error performing POST 
https://onelake.blob.fabric.microsoft.com/<WORKSPACE_ID>?restype=container&comp=batch
 in 7.561661ms - Server returned non-2xx status code
               ↳ 400 Bad Request
                ↳ {"error":{"code":"BadRequest","message":"Either WorkspaceId 
or ArtifactId are missing in the request"}}
   ```
   
   Disclosure: I have used AI to trace this issue from the Python code to 
arrow-rs-object-store, though I have done my best to independently verify the 
bug in the library and write this report.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to