GitHub user kunwp1 created a discussion: Supporting Directory-Based Dataset
Access in UDFs
## The Problem
Currently, Texera's `DatasetFileDocument` API allows users to fetch individual
files as streams. However, some bioinformatics and data science libraries
require a local filesystem directory path to function.
For example, R’s `read10X(data.dir=...)` and Python’s `scanpy.read_10x_mtx()`
expect a folder containing a specific set of files (e.g., `matrix.mtx.gz`,
`barcodes.tsv.gz`). These functions cannot operate on isolated file streams;
they need a physical directory handle.
## Root Cause
Texera's backend (LakeFS) stores data as flat objects using path-separator
names. Consequently, there is no native "folder" object to pass to a UDF. Our
current Python SDK and backend endpoints lack a mechanism to materialize a
specific path prefix (a "folder") as a local directory on a computing unit.
---
## Design Choice 1: Client-Side File-by-File Materialization
Introduce a `DatasetFolderDocument` class in the Python SDK that simulates a
directory by downloading all objects with a matching prefix.
```python
with DatasetFolderDocument("/[email protected]/myDataset/v1/Counts") as
local_path:
adata = sc.read_10x_mtx(local_path)
```
Workflow:
1. SDK queries the existing file-tree API to enumerate all files under the
specified prefix.
2. SDK downloads each file individually via presigned URLs.
3. SDK recreates the sub-directory structure in a local `/tmp` directory.
4. The context manager returns the path and handles cleanup on exit.
Pros:
- Requires no backend changes; purely an SDK-level implementation.
Cons:
- N HTTP round-trips.
- Potential for partial state failures (e.g., download fails halfway through).
- Temp disk usage on the worker is proportional to folder size.
## Design Choice 2: Server-Side Archiving
Add a backend REST endpoint that accepts a folder path and streams a single
archive (e.g., ZIP or Tar) containing the requested files. The user experience
is same as Design Choice 1.
```python
with DatasetFolderDocument("/[email protected]/myDataset/v1/Counts") as
local_path:
adata = sc.read_10x_mtx(local_path)
```
Workflow:
1. SDK makes a single request to the new endpoint.
2. The backend filters the LakeFS objects and streams a ZIP on-the-fly.
3. The SDK downloads and extracts this single archive to a local temp directory.
Pros:
- Reduces N HTTP round-trips to 1
- Atomic download
Cons:
- Requires a new backend endpoint
- Temp disk usage on the worker is proportional to folder size.
GitHub link: https://github.com/apache/texera/discussions/4352
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]