GitHub user kunwp1 created a discussion: Supporting Directory-Based Dataset 
Access in UDFs

## The Problem
Currently, Texera's `DatasetFileDocument` API allows users to fetch individual 
files as streams. However, some bioinformatics and data science libraries 
require a local filesystem directory path to function. 

For example, R’s `read10X(data.dir=...)` and Python’s `scanpy.read_10x_mtx()` 
expect a folder containing a specific set of files (e.g., `matrix.mtx.gz`, 
`barcodes.tsv.gz`). These functions cannot operate on isolated file streams; 
they need a physical directory handle.

## Root Cause
Texera's backend (LakeFS) stores data as flat objects using path-separator 
names. Consequently, there is no native "folder" object to pass to a UDF. Our 
current Python SDK and backend endpoints lack a mechanism to materialize a 
specific path prefix (a "folder") as a local directory on a computing unit.

---

## Design Choice 1: Client-Side File-by-File Materialization
Introduce a `DatasetFolderDocument` class in the Python SDK that simulates a 
directory by downloading all objects with a matching prefix.

```python
with DatasetFolderDocument("/[email protected]/myDataset/v1/Counts") as 
local_path:
    adata = sc.read_10x_mtx(local_path)
```

Workflow:

1. SDK queries the existing file-tree API to enumerate all files under the 
specified prefix.
2. SDK downloads each file individually via presigned URLs.
3. SDK recreates the sub-directory structure in a local `/tmp` directory.
4. The context manager returns the path and handles cleanup on exit.

Pros:
- Requires no backend changes; purely an SDK-level implementation.

Cons:
- N HTTP round-trips.
- Potential for partial state failures (e.g., download fails halfway through).
- Temp disk usage on the worker is proportional to folder size.

## Design Choice 2: Server-Side Archiving

Add a backend REST endpoint that accepts a folder path and streams a single 
archive (e.g., ZIP or Tar) containing the requested files. The user experience 
is same as Design Choice 1.

```python
with DatasetFolderDocument("/[email protected]/myDataset/v1/Counts") as 
local_path:
    adata = sc.read_10x_mtx(local_path)
```

Workflow:
1. SDK makes a single request to the new endpoint.
2. The backend filters the LakeFS objects and streams a ZIP on-the-fly.
3. The SDK downloads and extracts this single archive to a local temp directory.

Pros:
- Reduces N HTTP round-trips to 1
- Atomic download

Cons:
- Requires a new backend endpoint
- Temp disk usage on the worker is proportional to folder size.

GitHub link: https://github.com/apache/texera/discussions/4352

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to