On Wed, Apr 22, 2020 at 11:06 AM Jeff Klukas <jklu...@mozilla.com> wrote:
> Beam is able to infer compression from file extensions for a variety of > formats, but snappy is not among them currently: > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java > > Although ParquetIO and AvroIO each look to have support for snappy. > > So as best I can tell, there is no current built-in support for reading > text files compressed via snappy. I think you would need to use FileIO to > match files, and then implement a custom DoFn that take the file object, > streams the contents through a snappy decompressor, and outputs one record > per line. > > I imagine a PR to add snappy as a supported format in Compression.java > would be welcome. > +1, and probably not that difficult either. > > On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <chlar...@google.com> > wrote: > >> Hi devs, >> >> We are trying to build a pipeline to read snappy compressed text files >> that contain one record per line using the Java SDK. >> >> We have tried the following to read the files: >> >> p.apply("ReadLines", >> FileIO.match().filepattern((options.getInputFilePattern()))) >> .apply(FileIO.readMatches()) >> .setCoder(SnappyCoder.of(ReadableFileCoder.of())) >> .apply(TextIO.readFiles()) >> .apply(ParDo.of(new TransformRecord())); >> >> Is there a recommended way to decompress and read Snappy files with Beam? >> >> Thanks, >> Chris >> >