Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22528#discussion_r219691537 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -41,7 +42,12 @@ object CodecStreams { getDecompressionCodec(config, file) .map(codec => codec.createInputStream(inputStream)) - .getOrElse(inputStream) + .orElse { + if (file.getName.toLowerCase.endsWith(".zip")) { + val zip = new ZipArchiveInputStream(inputStream) + if (zip.getNextEntry != null) Some(zip) else None + } else None + }.getOrElse(inputStream) --- End diff -- > ... but isn't this difficult to extend this support to non multiline modes ... In per-line mode, need to teach Hadoop's Line Reader how to read zip archives. It seems the feasible way is via `Decompressor` interface. The problem is the interface is designed to read compression data in stream fashion. That contradicts to structure of ZIP format where the directory structure is located at the end of file. > and multiple files) as well In general, it is possible to support per-line mode and multiple files, if `LineRecordReader` will be extended in the same way as I do in this PR (not via `Decompressor` API) but `LineRecordReader`, `SplitLineReader` and maybe something else should to be move out of Hadoop to Spark. From my point of view it could be done but not in this simple PR. > Basically deflate is the same codec and I wonder if we really should allow this zip one specifically in multiline mode for CSV / JSON specifically with a clear restriction (single file). Significant different between `zip` and `deflate` is the first one is some kind of container of independently compressed entries. In Spark's datasources, there is a restriction for now - one file maps to one `InputStream`: - in per-line mode: ``` in = new SplitLineReader(codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes); filePosition = fileIn;``` . If you return one `InputStream` for all files in a zip archive, it can be difficult to differentiate last line in one file and the first line in another file. - in multi-line mode, the `readFile` methods should be modified to handle multiple `InputStream` per `file: PartitionedFile` (to eliminate restriction of one file per zip archive). For sure, all of those issues can be solved but - Multiple files per zip archive is significantly rare case that one zipped JSON/CSV in the wild. Maybe you have another statistics. - Support per-line mode demands principal changes - extension and movement to Spark Hadoop's per-line reader.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org