djouallah opened a new issue, #695: URL: https://github.com/apache/arrow-rs-object-store/issues/695
# `MicrosoftAzure::list_with_offset` returns empty on OneLake since 0.13.0 (regression from #623) ## Describe the bug Against Microsoft Fabric OneLake (`*.dfs.fabric.microsoft.com`), `ObjectStore::list_with_offset(prefix, offset)` returns **zero** entries even when the prefix contains files strictly greater than `offset`. The equivalent `list(prefix)` on the same store returns the correct files, so the data is reachable — only the offset-based listing is broken. This regressed in `object_store` 0.13.0 via #623, which replaced the default fallback with an Azure-specific implementation that uses the ADLS Gen2 `startFrom` URI parameter. OneLake's REST surface does not handle `startFrom` the same way the standard ADLS Gen2 endpoint does. ## Impact Every downstream that uses `list_with_offset` against OneLake is broken on `object_store >= 0.13.0`: - `delta-kernel-rs` (used by DuckDB's delta extension, delta-rs): loading a Delta table with a `_last_checkpoint` hint fails with `Invalid Checkpoint: Had a _last_checkpoint hint but didn't find any checkpoints`. See [delta-io/delta-kernel-rs#2433](https://github.com/delta-io/delta-kernel-rs/issues/2433) and the (now-closed) workaround attempt [#2437](https://github.com/delta-io/delta-kernel-rs/pull/2437). - `lakehq/sail` (does NOT use delta-kernel-rs; independently hits the same bug): [lakehq/sail#1730](https://github.com/lakehq/sail/issues/1730). ## To Reproduce Minimal, no-delta-kernel reproducer below. The only thing swapped between the two runs is the `object_store` pin. `Cargo.toml`: ```toml [package] name = "onelake-repro" version = "0.0.1" edition = "2021" [dependencies] # Swap between "=0.12.5" (works) and "=0.13.2" (broken) object_store = { version = "=0.13.2", features = ["azure"] } futures = "0.3" tokio = { version = "1", features = ["rt-multi-thread", "macros"] } url = "2" anyhow = "1" ``` `src/main.rs`: ```rust use std::env; use anyhow::{anyhow, Context, Result}; use futures::stream::StreamExt; use object_store::azure::{AzureConfigKey, MicrosoftAzureBuilder}; use object_store::path::Path; use object_store::{ObjectMeta, ObjectStore}; #[tokio::main(flavor = "multi_thread", worker_threads = 2)] async fn main() -> Result<()> { let args: Vec<String> = env::args().collect(); if args.len() != 5 { return Err(anyhow!("usage: onelake-repro <workspace> <lakehouse> <table> <checkpoint_version>")); } let workspace = &args[1]; let lakehouse = &args[2]; let table = &args[3]; let ckpt_version: u64 = args[4].parse()?; let token = env::var("AZURE_STORAGE_TOKEN") .context("AZURE_STORAGE_TOKEN not set")?; let url = format!( "abfss://{workspace}@onelake.dfs.fabric.microsoft.com/{lakehouse}.Lakehouse/Tables/{table}/" ); let store = MicrosoftAzureBuilder::new() .with_url(url.as_str()) .with_config(AzureConfigKey::Token, token) .build()?; let prefix_str = format!("{lakehouse}.Lakehouse/Tables/{table}/_delta_log"); let prefix = Path::from(prefix_str.as_str()); let offset = Path::from(format!("{prefix_str}/{ckpt_version:020}").as_str()); let a = collect(store.list(Some(&prefix))).await?; println!("A) list(prefix): {} entries", a.len()); for loc in &a { println!(" {loc}"); } let b = collect(store.list_with_offset(Some(&prefix), &offset)).await?; println!("\nB) list_with_offset(prefix, offset): {} entries", b.len()); for loc in &b { println!(" {loc}"); } Ok(()) } async fn collect<S>(mut s: S) -> Result<Vec<String>> where S: futures::Stream<Item = object_store::Result<ObjectMeta>> + Unpin { let mut out = vec![]; while let Some(m) = s.next().await { out.push(m?.location.to_string()); } out.sort(); Ok(out) } ``` Run: ```bash export AZURE_STORAGE_TOKEN=$(az account get-access-token --resource https://storage.azure.com/ --query accessToken -o tsv) cargo run --release -- <workspace> <lakehouse> <table> <checkpoint_version> ``` ## Expected behavior `list_with_offset(prefix, offset)` should return exactly the files in `list(prefix)` whose location is lexicographically greater than `offset`. ## Actual behavior Against the same OneLake table (a Delta table with `_last_checkpoint` at v10): **With `object_store = "=0.12.5"`** (works): ``` A) list(prefix): 11 entries _delta_log/00000000000000000005.json _delta_log/00000000000000000006.json _delta_log/00000000000000000007.json _delta_log/00000000000000000008.json _delta_log/00000000000000000009.json _delta_log/00000000000000000010.checkpoint.parquet _delta_log/00000000000000000010.json _delta_log/00000000000000000011.json _delta_log/00000000000000000012.json _delta_log/00000000000000000013.json _delta_log/_last_checkpoint B) list_with_offset(prefix, _delta_log/00000000000000000010): 6 entries _delta_log/00000000000000000010.checkpoint.parquet _delta_log/00000000000000000010.json _delta_log/00000000000000000011.json _delta_log/00000000000000000012.json _delta_log/00000000000000000013.json _delta_log/_last_checkpoint ``` **With `object_store = "=0.13.2"`** (broken): ``` A) list(prefix): 11 entries <-- identical to above ... B) list_with_offset(prefix, _delta_log/00000000000000000010): 0 entries ``` (Only `list_with_offset` differs between the two runs.) ## Suspected cause [#623](https://github.com/apache/arrow-rs-object-store/pull/623) added a direct `list_with_offset` implementation for Azure that sends `startFrom=<offset>` per the [ADLS Gen2 list-blobs API](https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?view=rest-storageservices-datalakestoragegen2-2019-12-12&tabs=microsoft-entra-id#uri-parameters). OneLake's endpoint apparently does not implement `startFrom` compatibly — it returns an empty list regardless of the offset value. This matches `lonless9`'s analysis on [lakehq/sail#1730](https://github.com/lakehq/sail/issues/1730) and the related [Azurite#2619](https://github.com/Azure/Azurite/issues/2619#issuecomment-3660701055). ## Environment - `object_store` 0.13.2 (and 0.13.0, 0.13.1 — all contain #623) - OneLake endpoint `onelake.dfs.fabric.microsoft.com` - Service-principal / Azure CLI bearer token (same auth in both runs; auth is not the issue) - Observed on Windows 11 / rustc 1.95.0, but not platform-dependent --- Repro and report co-drafted with [Claude Code](https://claude.com/claude-code) (Claude Opus 4.7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
