eugenegujing opened a new issue, #5634:
URL: https://github.com/apache/texera/issues/5634

   ### Task Summary
   
   ## Background
   
   Sub-task of #5242 (external data source import and export). Per the 
discussion in #4240, the import direction is handled as a separate effort from 
the Google Drive export work (#5250 / #5251 / #5252 by @Sentiaus).
   
   This task adds the **import** direction for the first provider, Google 
Drive, along with a minimal provider abstraction so additional providers 
(Dropbox — see sibling sub-issue; Box later) can plug in, as suggested by 
@xuang7.
   
   ## Design principles (following @aicam's review decisions on the export flow 
in #4240)
   
   1. **No token persistence** — the user authorizes each import; the frontend 
obtains a one-time OAuth access token, passes it to the backend with the import 
request, and the backend discards it after the transfer. Nothing is ever stored 
in the DB.
   2. **Backend streaming** — the backend streams the file directly from the 
Google Drive API into dataset storage (LakeFS/S3), reusing the existing 
multipart upload pipeline in `DatasetResource`. The file never round-trips 
through the browser.
   
   ## Scope choice: Google Picker + `drive.file` only
   
   The frontend uses the official **Google Picker** for file selection, 
requesting only the **`drive.file`** scope:
   
   - `drive.file` is a **non-sensitive** scope: deployments need **no Google 
restricted-scope security verification**, in any status (Testing or Production).
   - Google enforces at the permission layer that the app can only access files 
the user explicitly picked in the Picker — the app never lists or sees the rest 
of the user's Drive.
   - The consent screen reduces to a single grant ("access only the specific 
files you use with this app"), which also avoids partial-consent failure modes 
(e.g., an in-app file browser turning up empty when a user denies the broad 
scope).
   
   The alternative — rendering a Drive file tree inside Texera — would require 
the restricted `drive.readonly` scope ("see and download all your Google Drive 
files") and a weeks-long Google security review per public deployment, for a UX 
that is effectively identical in the single-file import case. The provider 
interface still exposes a `listFiles` capability so in-app browsing can be 
added later if the community accepts that cost (the Dropbox provider will use 
it, since Dropbox has no comparable review process).
   
   ## Proposed changes
   
   - **Backend (file-service):**
     - A small provider interface (e.g. `CloudStorageImportProvider`: list 
files / open download stream) with a Google Drive implementation (`GET 
/drive/v3/files/{fileId}?alt=media` streaming download using the one-time 
bearer token).
     - A new endpoint on `DatasetResource`, e.g. `POST 
/dataset/{did}/import-from-cloud`, taking `{provider, accessToken, fileId, 
fileName}`; streams the file into the dataset's LakeFS repo and stages/commits 
it like a normal upload, with the same dataset write-permission checks as the 
existing upload endpoint.
     - Provider configuration (OAuth client ID, Picker API key) via env vars, 
mirroring how Google login is configured today 
(`UserSystemConfig.googleClientId`).
   - **Frontend:**
     - An "Import from cloud" entry next to the existing file uploader on the 
dataset page.
     - One-time authorization via the Google Identity Services token client 
(`drive.file` scope), then the Google Picker for selection, restricted with 
`setSelectableMimeTypes` to formats Texera datasets accept (csv, json, parquet, 
text, etc.), so unsupported types (videos, native Google Docs) are filtered out 
at selection time instead of failing after import.
     - A simple in-progress indicator while the backend streams the file into 
storage.
   
   ## Out of scope (future iterations)
   
   - **Folder / bulk import** — folder-level access would require the 
restricted `drive.readonly` scope and Google's security verification 
(weeks-to-months review); deferred until the need is validated.
   - **In-app Drive file browsing** — same `drive.readonly` requirement; see 
scope-choice section above.
   - **Background/async job handling for very large files** (the 100 GB–1 TB 
scale discussed in #4240); the MVP imports synchronously within the request.
   - **Native Google Docs/Sheets/Slides export conversion** — only 
binary/regular files are imported in the MVP.
   
   ### Task Type
   
   - [ ] Refactor / Cleanup
   - [ ] DevOps / Deployment / CI
   - [ ] Testing / QA
   - [ ] Documentation
   - [ ] Performance
   - [x] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to