Hey Sean, 
Thanks for your interest in Drill.  Maybe we could take a step back here.  
Could you explain your use case in a little more detail?  It sounds to me like 
you'd like the ability to write compressed parquet files and to choose the 
compression codec.  This might be a good feature to add as a config option.  
IE:  When you execute a CTAS query, you could select compression... or not.

Thanks!
-- C


> On Jun 18, 2021, at 10:15 AM, Leyne, Sean <[email protected]> wrote:
> 
> James,
> 
>> -----Original Message-----
>> From: James Turton <[email protected]>
> 
>> Zip is a file format, not a codec.  Various codecs are employed in Zip 
>> archives,
>> most commonly DEFLATE.  The different set of codecs that are supported in
>> the Parquet file format are described in https://github.com/apache/parquet-
>> format/blob/master/Compression.md.
> 
> Thanks for the link, the problem is that often the codec and the file format 
> are synonymous, so people like myself don't make the distinction.
> 
> Not helping is the Drill use of the ambiguous "Compression Type" terminology 
> rather than "codec" in the Drill options.
> 
> 
>> Since, then, Zip is not sensible or possible inside a Parquet file, the only 
>> way to
>> effect what you describe would be to embed a Parquet file inside a Zip
>> archive.  This would be perverse and misguided but possibly still queryable
>> since Drill might transparently do the right things to decode it anyway.  
>> Using a
>> supported codec within the Parquet file format and forgetting about Zip is
>> certainly a better approach.
> 
> Might seem perverse to you, however, given that that "zip compression" 
> support for text file was added in v1.17.0 (DRILL-5674)*, I think it is a 
> reasonable question to ask about support for Parquet files.
> 
> *there were no details on which of the codecs are supported.
> 
> 
>>   If you want compression ratios comparable to
>> those found in Zip files then you would choose GZip and pay with CPU
>> cycles.  When Drill gains support for Zstandard there will be little reason 
>> to
>> choose anything else.
> 
> This is another area of confusion, if Parquet provides support for ZSTD (as 
> well as other codecs) why doesn't Drill?  
> 
> Isn't there a standard "Parquet Library" that is available which enables 
> Parquet file support with all "features", which any project implementing 
> Parquet file support would use?
> 
> 
> 
>> 
>> On 2021/06/17 18:59, Leyne, Sean wrote:
>>> Luoc,
>>> 
>>>>   Could you please tell me first which case you are talking about?
>>>> Only write(CTAS syntax) or read(SELECT)?
>>> Really both, since you need a mechanism to create the zip'd parquet file to
>> begin with.  Having to create a special/side process to zip the file outside 
>> of
>> drill would be ... awkward.
>>> 
>>> 
>>> Sean
>>> 
>>>>> 在 2021年6月16日,02:26,Leyne, Sean
>>>> <[email protected]> 写道:
>>>>> All,
>>>>> 
>>>>> The documentation describes that gzip/gz compression as supported
>>>>> for
>>>> text files, and that snappy and gzip are support for parquet files.
>>>>> I have also read that zip compression was also added (though not
>>>> documented) for text files.
>>>>> 
>>>>> But is zip also supported for parquet files?
>>>>> 
>>>>> What about support for other compression algorithms/methods?  LZ4?
>>>> Bzip2? zstd??
>>>>> 
>>>>> Sean
>>>>> 
>>>>> 
>>>>> 
> 

Reply via email to