GitHub user kunwp1 added a comment to the discussion: Supporting 
Directory-Based Dataset Access in UDFs

Following our offline discussion, we've decided to integrate new operators 
being developed by @aglinxinyuan with the Python UDF. The proposed workflow is 
as follows:

- Dataset Selector (Source): This operator accepts a dataset URI (e.g., 
/ownerEmail/datasetName/versionName) and flattens its structure into a table of 
file URIs. For instance, a nested structure like /folder1/file1 and 
/folder1/folder2/file2 will be returned as individual string rows.
- Text to File Scan: It's a downstream operator that resolves the URIs into 
file contents. Users can toggle an option to include the original URI as an 
attribute, resulting in tuples of (file_uri, file_content).
- Python UDF: This operator consumes these tuples, providing users with the raw 
paths and data.

We initially considered using `io.BytesIO` to provide a folder-like interface 
within the UDF. However, it seems like `io.ByteIO` can only mimic a single 
file, not a folder-like file system. So, the responsibility for reconstructing 
or mimicking a file-tree structure (e.g., `tempfile`) will rest with the UDF 
logic itself. 

Please feel free to add your comments or suggestions on this approach.

CC: @aglinxinyuan @chenlica @xuang7

GitHub link: 
https://github.com/apache/texera/discussions/4352#discussioncomment-16495961

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to