[
https://issues.apache.org/jira/browse/SPARK-57135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57135:
-----------------------------------
Labels: pull-request-available (was: )
> [SQL] Add ArchiveFormat for reading .tar / .tar.gz / .tgz archives as files
> ---------------------------------------------------------------------------
>
> Key: SPARK-57135
> URL: https://issues.apache.org/jira/browse/SPARK-57135
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Akshat Shenoi
> Priority: Major
> Labels: pull-request-available
>
> h2. Problem
> V1 {{FileFormat}} implementations (CSV, JSON, Parquet, ORC, etc.) are not
> archive-aware: if a user points a datasource reader at a {{{}.tar{}}},
> {{{}.tar.gz{}}}, or {{.tgz}} file, Spark treats it as a single opaque file
> and either fails or returns garbage instead of reading the entries inside.
> A common ingestion pattern stores many small files inside tar archives to
> reduce namespace pressure. Today there is no way to read these without first
> unpacking them externally.
> h2. Proposed Solution
> Add an {{ArchiveFormat}} utility object in
> {{org.apache.spark.sql.execution.datasources}} and hook it into the V1 scan
> pipeline:
> * {*}{{ArchiveFormat.readArchive}}{*}: at scan time, materializes one tar
> entry at a time to a local temp file and invokes the caller-supplied
> {{readFn}} against a synthetic {{PartitionedFile}} pointing at that temp
> file. Only one entry's bytes live on disk per task; the temp dir is cleaned
> up on iterator close and on task completion.
> * {*}{{ArchiveFormat.expandArchives}}{*}: at schema-inference time
> (driver-side), does the same materialization and substitutes the resulting
> {{{}FileStatuses into inferSchema{}}}.
> * {*}{{ArchiveFormat.isArchivePath}}{*}: detects {{{}.tar{}}},
> {{{}.tar.gz{}}}, and {{.tgz}} extensions.
> * Entries whose basename starts with {{.}} are skipped (covers macOS
> AppleDouble sidecars, {{{}.DS_Store{}}}, etc.).
> * Gzip handling: Hadoop's {{CompressionCodecFactory}} auto-decompresses
> {{.tar.gz}} via {{{}CodecStreams{}}}; {{.tgz}} is not a registered Hadoop
> codec extension so the gzip layer is unwrapped explicitly with
> {{{}GZIPInputStream{}}}.
> Materializing to disk (rather than streaming) means formats that need random
> access (Parquet/ORC footers) work without modification.
> The feature is gated behind {{spark.sql.files.archive.enabled}} (default
> {{{}false{}}}).
> h2. Integration Points
> # {{{}PartitionedFileUtil.splitFiles{}}}: archive paths forced to a single
> split.
> # {{{}FileScanRDD.readCurrentFile{}}}: archive paths routed through
> {{{}ArchiveFormat.readArchive{}}}.
> # {{{}DataSource.resolve{}}}: both {{inferSchema}} call sites expand
> archives before delegating to the format.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]