org.apache.hadoop.fs.FSDownload restricts tar archive extensions to .tar.gz, 
.tgz, and .tar
Assuming the tar ends with one of these extensions, the inputStream is passed 
to FileUtil.untar.
org.apache.hadoop.fs.FileUtil unTar calls unTarUsingTar for non windows 
systems, which then streams the input stream through stdin and essentially 
pipes it through the tar -x command.
If its a .tar.gz or .tgz file, it is run through gzip -dc before being piped 
into tar -x.
This is very restrictive given tar supports many different compression types. 
Specifically, we would like to add ZStandard compressed tar archives to our 
distributed cache. However these are not appropriately recognized as archives 
because the .tar.zst and .tzst extensions are not supported.
This could be supported by either adding "zst -dc | " before tar like is done 
with GZip, or by running tar -x --zstd.
With all the work that has been added to hadoop to support ZStandard since 
Hadoop 3.x, this seems like it would be a reasonable update.
Would it be possible to add ZStandard support to distributed cache archives?

Reply via email to