kunwp1 opened a new pull request, #5112: URL: https://github.com/apache/texera/pull/5112
## Summary The dataset file previewer (`UserDatasetFileRendererComponent`) previously identified files purely by extension and showed *"Preview of the file type is currently not supported"* for anything outside a small allow-list. This PR makes it identify-and-describe a much wider set of formats and surface rich per-format metadata. ### What changed - **Magic-byte detection**: replaces extension-only guessing. Uses the [`file-type`](https://www.npmjs.com/package/file-type) library (MIT) for ~100 common formats, plus hand-rolled signatures for Parquet (`PAR1`), Arrow (`ARROW1`), HDF5 (`\x89HDF\r\n\x1a\n`), NumPy `.npy` (`\x93NUMPY`), GGUF (`GGUF`), and Python pickle (`\x80\x02..\x05`). Extension-based refinement disambiguates ZIP containers (PyTorch `.pt`/`.pth`, Keras `.keras`, NumPy `.npz`) and gzipped R `.rds`. Text sniffing adds FASTA, FASTQ, VCF on top of the existing JSON / CSV / Markdown heuristics. - **Lightweight header parsing for ML formats**: - NumPy `.npy` → dtype, shape, byte-order, Fortran/C order - Safetensors → tensor count, total parameters, dtype breakdown, largest tensor, `__metadata__` - GGUF → version, tensor count, metadata KV count - **Rich metadata per type** displayed as a metadata strip above the preview: - **CSV / XLSX**: inferred column types (`integer` / `double` / `boolean` / `date` / `string`) and null counts shown directly under each column header in the data table; row & column counts; sheet count for XLSX - **JSON**: top-level type, item/key count, max nesting depth, per-key types - **PDF**: version, page count, `/Info` dictionary (Title, Author, Creator, Producer), encryption flag — rendered in `<iframe>` - **Images**: dimensions, aspect ratio (async via `<img>.onload`) - **Video / audio**: duration + resolution (async via `loadedmetadata`) - **FASTA**: total bases, GC content (skipped for proteins), min/max/avg sequence length - **VCF**: sample count parsed from `#CHROM` header, distinct chromosomes - **Single-cell / R**: AnnData (`.h5ad`), Seurat (`.h5seurat`, `.rds`), Loom — identification + "how to load" hint - **Memory-safe rendering**: text/CSV/JSON parsing is bounded at 10 MB (`getPreviewSlice`) to avoid browser OOM on large files. A warning banner appears when truncation occurs; truncation-affected stats (`sequenceCountIsExact`, `variantCountIsExact`) flip accordingly. `turnOffAllDisplay` now clears `textContent` / `tableContent` / `currentFile` so switching files reclaims memory. Per-MIME size cap raised to 1 GB from the prior 1–50 MB. - **Async safety**: `ChangeDetectorRef` injected and `markForCheck()` called from media `loadedmetadata` / `<img>.onload` callbacks, preserving the existing default change-detection strategy while supporting an eventual OnPush migration. ### Files changed - `frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/user-dataset-file-renderer/user-dataset-file-renderer.component.ts` — detection logic, parsers, render dispatch, metadata getter - `…/user-dataset-file-renderer.component.html` — metadata strip, PDF iframe, truncation banner, column-type tags on table headers - `…/user-dataset-file-renderer.component.scss` — metadata pill / column tag styles - `…/user-dataset-file-renderer.component.spec.ts` — 28 new tests (30 total) - `frontend/package.json`, `frontend/yarn.lock` — `[email protected]` (MIT) ## Test plan - [x] `yarn ng test --include="**/user-dataset-file-renderer.component.spec.ts" --watch=false` — **30 / 30 passing** (existing 2 retained, 28 new covering magic-byte detection, extension refinement, NumPy/Safetensors/GGUF header parsing, and column type inference) - [ ] Frontend visual review: open various file types in the dataset previewer and verify the metadata strip + column type tags render - [ ] Before/after screenshots / GIFs *(not included in this draft; per AGENTS.md these should be added before merge)* ## Notes for reviewers - This is exploratory hackathon work; **a tracking issue should be filed before merge** per AGENTS.md. - The 1 GB preview limit still triggers a full file download from the dataset service. A follow-up could add HTTP Range request support so identify-only formats (Parquet, HDF5, pickle, model containers) fetch only the first 64 KB. - HDF5 sub-types (`.h5ad` / `.h5seurat` / `.loom`) are distinguished by extension because they share identical magic bytes; deep parsing would need an HDF5 reader (e.g. h5wasm) which is intentionally not included. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
