This is awesome! Congratulations!

For anyone who wants to get this header onto files going on S3 it is fairly
straightforward on hadoop 3.3.5+

FSDataOutputStream out = fs.createFile("s3a://data/output.parquet")
  .overwrite(true)  // saves a HEAD
  .opt("fs.s3a.create.header.Content-Type",
"application/vnd.apache.parquet")
  .progress(heartbeat-callback)  // if your app needs to heartbeat in
close()
  .build()

// do all the writing...)
out.close()

createFile() has been around since hadoop 2.9.0, though the support for
adding headers to HTTP objects is only in hadoop 3.3.5; it'll be ignored on
older versions *or with other stores*

Drop that hadoop-2 profile and the parquet library writer can automatically
set it.

Note: aws S3 doesn't let you update an object's mime type after creation,
so you will need to do it at this point or when copying objects from one
place to another.



On Tue, 5 Mar 2024 at 19:38, Bryce Mecum <[email protected]> wrote:

> Hi all, the Parquet format now has an official IANA media type:
> application/vnd.apache.parquet [1] and I wanted to announce that here
> so folks are aware and can share the news within their communities.
> Please see the Parquet Jira ticket [2] for more information.
>
> I think this is important because having an official media type
> reduces friction when integrating Parquet into other systems and
> prevents the proliferation of non-standard alternatives. This leads me
> to my next point:
>
> Use of the unregistered/non-standard application/x-parquet has become
> quite widespread over the years [3] and a considerable number of
> systems may want to consider adapting their systems in one way or
> another around this change. All thoughts about whether or how to go
> about this are welcome.
>
> Last, is it possible I could get someone to tweet about this on the
> @ApacheParquet account [4]? I'm happy to help provide text but I'm not
> sure who to reach out to.
>
> PS: Many thanks to Gang Wu, Gidon Gershinksy, and Gabor Szadovszky for
> recent feedback on the Jira ticket.
>
> [1]
> https://www.iana.org/assignments/media-types/application/vnd.apache.parquet
> [2] https://issues.apache.org/jira/browse/PARQUET-1889
> [3] https://github.com/search?q=application%2Fx-parquet&type=code
> [4] https://twitter.com/ApacheParquet
>

Reply via email to