Hi, Sorry for not responding this...
Weston, thanks for writing up the draft! https://docs.google.com/document/d/1PmZFoSifV_TX4vXnv775WiOtqCgz5zLF5ryFRWio3HQ/edit?usp=sharing Here are items we need to discuss before we apply a media type to IANA: 1. Interoperability Considerations Draft: > The Apache arrow format is intended to be a language > independent columnar memory format for flat and > hierarchical data. It has been shown to work in a variety > of languages and applications. Arrow files can be > provided in two different formats, a streaming format > (vnd.apache.arrow.stream) and a random access format > (vnd.apache.arrow.file). Applications should be aware of > which format they are processing as the two are not > interchangeable. Note in draft: > Should we mention something like "applications should > make sure to check the 'version' field to ensure they > can process the file"? How about referring our format document for further information instead of mention the 'version' field? https://arrow.apache.org/docs/format/Columnar.html XML Media Types also refers the XML specification for further information: https://tools.ietf.org/html/rfc7303#section-9.1 > For further information, see Section 2.9 "Standalone > Document Declaration" and Section 5 "Conformance" of [XML]. 2. File extension(s) Draft: > N/A Note in draft: > Again, there are no formal extensions that have been > recommended before. Do we want to introduce any? I'm > pretty sure this is in no way binding (and it's unlikely > anyone will ever see it). I want recommended extensions to avoid spreading various extensions for Apache Arrow formats. How about the followings? * vnd.apache.arrow.file: .arrow * vnd.apache.arrow.stream: NA (Generally, this format isn't saved as file. This format is used for pipe, sending/receiving via socket and so on.) FYI: Here is a list that shows used extensions in our code base. Our integration test uses the following extensions: * vnd.apache.arrow.file: .arrow_file * vnd.apache.arrow.stream: .stream https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L250-L257 log('-- Validating file') producer_file_path = os.path.join( gold_dir, "generated_" + test_case.name + ".arrow_file") consumer.validate(json_path, producer_file_path) log('-- Validating stream') consumer_stream_path = os.path.join( gold_dir, "generated_" + test_case.name + ".stream") Our C++ tests use the following extensions: * vnd.apache.arrow.file: Not used (in-memory buffer is used) * vnd.apache.arrow.stream: Not used (in-memory buffer is used) Our C++ examples use the following extensions: * vnd.apache.arrow.file: .arrow * vnd.apache.arrow.stream: NA https://github.com/apache/arrow/blob/master/cpp/examples/minimal_build/example.cc#L34 const char* arrow_filename = "test.arrow"; Our Python documentation uses the following extensions: * vnd.apache.arrow.file: .arrow * vnd.apache.arrow.stream: Not used (in-memory buffer is used) https://github.com/apache/arrow/blob/master/docs/source/python/filesystems.rst with local.open_output_stream("test.arrow") as file: Our Go tests use the following extensions: * vnd.apache.arrow.file: Not used (no extension) * vnd.apache.arrow.stream: Not used (no extension) Our Java tests use the following extensions: * vnd.apache.arrow.file: .arrow * vnd.apache.arrow.stream: .arrow but most of tests use in-memory buffer https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowFile.java#L51 File file = new File("target/mytest_write.arrow"); https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestRoundTrip.java#L176 final File temp = File.createTempFile("arrow-test-" + name + "-", ".arrow"); Our JavaScript tests use the following extensions: * vnd.apache.arrow.file: Not used (in-memory buffer is used) * vnd.apache.arrow.stream: Not used (in-memory buffer is used) Our Julia tests use the following extensions: * vnd.apache.arrow.file: Not used (in-memory buffer is used) * vnd.apache.arrow.stream: Not used (in-memory buffer is used) Our Rust tests use the following extensions: * vnd.apache.arrow.file: .arrow_file * vnd.apache.arrow.stream: .stream Note that they use data in our integration test. Thanks, -- kou In <cajpuwmckzuppmol-o0+d6fjwk-eas2teyf_pw0qzthhvx-9...@mail.gmail.com> "Re: Please Review: Application for a Media Type" on Fri, 22 Jan 2021 14:37:35 -0600, Wes McKinney <wesmck...@gmail.com> wrote: > Thank you for taking the lead on this. I gave a brief read through and > I think it makes sense using Thrift or Protocol Buffers as a > guideline. Would be good for some others to review who might be > familiar with IANA media formats > > On Wed, Jan 20, 2021 at 6:17 PM Weston Pace <weston.p...@gmail.com> wrote: >> >> Per a previous discussion >> (https://lists.apache.org/thread.html/b15726d0c0da2223ba1b45a226ef86263f688b20532a30535cd5e267%40%3Cdev.arrow.apache.org%3E) >> and the resulting JIRA issue ARROW-7396 >> (https://issues.apache.org/jira/browse/ARROW-7396) there is a desire >> to register the arrow format with the IANA as a formal media type >> (actually two media types, one for the streaming format and one for >> the file format). >> >> The form for applying is here: https://www.iana.org/form/media-types >> >> I have created a draft registration document (link below). >> >> The only fields with any real flexibility are "Security >> Considerations", "Interoperability Considerations", and "Application >> Usage". I reviewed the applications for XML, JSON, and Thrift and >> I've made a best attempt at these fields as well as posted examples >> from the other languages. Please review and feel free to suggest >> changes. >> >> https://docs.google.com/document/d/1PmZFoSifV_TX4vXnv775WiOtqCgz5zLF5ryFRWio3HQ/edit?usp=sharing >> >> One we align on the content we should probably have a PMC member >> actually make the submission and be listed as contact person. >> >> Thanks, >> >> Weston Pace >> Ursa Computing