Hey Sean, Thanks for your interest in Drill. Maybe we could take a step back here. Could you explain your use case in a little more detail? It sounds to me like you'd like the ability to write compressed parquet files and to choose the compression codec. This might be a good feature to add as a config option. IE: When you execute a CTAS query, you could select compression... or not.
Thanks! -- C > On Jun 18, 2021, at 10:15 AM, Leyne, Sean <[email protected]> wrote: > > James, > >> -----Original Message----- >> From: James Turton <[email protected]> > >> Zip is a file format, not a codec. Various codecs are employed in Zip >> archives, >> most commonly DEFLATE. The different set of codecs that are supported in >> the Parquet file format are described in https://github.com/apache/parquet- >> format/blob/master/Compression.md. > > Thanks for the link, the problem is that often the codec and the file format > are synonymous, so people like myself don't make the distinction. > > Not helping is the Drill use of the ambiguous "Compression Type" terminology > rather than "codec" in the Drill options. > > >> Since, then, Zip is not sensible or possible inside a Parquet file, the only >> way to >> effect what you describe would be to embed a Parquet file inside a Zip >> archive. This would be perverse and misguided but possibly still queryable >> since Drill might transparently do the right things to decode it anyway. >> Using a >> supported codec within the Parquet file format and forgetting about Zip is >> certainly a better approach. > > Might seem perverse to you, however, given that that "zip compression" > support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a > reasonable question to ask about support for Parquet files. > > *there were no details on which of the codecs are supported. > > >> If you want compression ratios comparable to >> those found in Zip files then you would choose GZip and pay with CPU >> cycles. When Drill gains support for Zstandard there will be little reason >> to >> choose anything else. > > This is another area of confusion, if Parquet provides support for ZSTD (as > well as other codecs) why doesn't Drill? > > Isn't there a standard "Parquet Library" that is available which enables > Parquet file support with all "features", which any project implementing > Parquet file support would use? > > > >> >> On 2021/06/17 18:59, Leyne, Sean wrote: >>> Luoc, >>> >>>> Could you please tell me first which case you are talking about? >>>> Only write(CTAS syntax) or read(SELECT)? >>> Really both, since you need a mechanism to create the zip'd parquet file to >> begin with. Having to create a special/side process to zip the file outside >> of >> drill would be ... awkward. >>> >>> >>> Sean >>> >>>>> 在 2021年6月16日,02:26,Leyne, Sean >>>> <[email protected]> 写道: >>>>> All, >>>>> >>>>> The documentation describes that gzip/gz compression as supported >>>>> for >>>> text files, and that snappy and gzip are support for parquet files. >>>>> I have also read that zip compression was also added (though not >>>> documented) for text files. >>>>> >>>>> But is zip also supported for parquet files? >>>>> >>>>> What about support for other compression algorithms/methods? LZ4? >>>> Bzip2? zstd?? >>>>> >>>>> Sean >>>>> >>>>> >>>>> >
