Hi,

Sorry for not responding this...

Weston, thanks for writing up the draft!
https://docs.google.com/document/d/1PmZFoSifV_TX4vXnv775WiOtqCgz5zLF5ryFRWio3HQ/edit?usp=sharing

Here are items we need to discuss before we apply a media
type to IANA:

1. Interoperability Considerations

Draft:

> The Apache arrow format is intended to be a language
> independent columnar memory format for flat and
> hierarchical data.  It has been shown to work in a variety
> of languages and applications.  Arrow files can be
> provided in two different formats, a streaming format
> (vnd.apache.arrow.stream) and a random access format
> (vnd.apache.arrow.file).  Applications should be aware of
> which format they are processing as the two are not
> interchangeable.

Note in draft:

> Should we mention something like "applications should
> make sure to check the 'version' field to ensure they
> can process the file"?

How about referring our format document for further
information instead of mention the 'version' field?
https://arrow.apache.org/docs/format/Columnar.html

XML Media Types also refers the XML specification for
further information:

https://tools.ietf.org/html/rfc7303#section-9.1

> For further information, see Section 2.9 "Standalone
> Document Declaration" and Section 5 "Conformance" of [XML].


2. File extension(s)

Draft:

> N/A

Note in draft:

> Again, there are no formal extensions that have been
> recommended before.  Do we want to introduce any?  I'm
> pretty sure this is in no way binding (and it's unlikely
> anyone will ever see it).

I want recommended extensions to avoid spreading various
extensions for Apache Arrow formats.

How about the followings?

  * vnd.apache.arrow.file: .arrow
  * vnd.apache.arrow.stream: NA
    (Generally, this format isn't saved as file. This format
    is used for pipe, sending/receiving via socket and so on.)

FYI: Here is a list that shows used extensions in our code
base.

Our integration test uses the following extensions:

  * vnd.apache.arrow.file: .arrow_file
  * vnd.apache.arrow.stream: .stream

https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L250-L257

    log('-- Validating file')
    producer_file_path = os.path.join(
        gold_dir, "generated_" + test_case.name + ".arrow_file")
    consumer.validate(json_path, producer_file_path)

    log('-- Validating stream')
    consumer_stream_path = os.path.join(
        gold_dir, "generated_" + test_case.name + ".stream")

Our C++ tests use the following extensions:

  * vnd.apache.arrow.file: Not used (in-memory buffer is used)
  * vnd.apache.arrow.stream: Not used (in-memory buffer is used)

Our C++ examples use the following extensions:

  * vnd.apache.arrow.file: .arrow
  * vnd.apache.arrow.stream: NA

https://github.com/apache/arrow/blob/master/cpp/examples/minimal_build/example.cc#L34

    const char* arrow_filename = "test.arrow";

Our Python documentation uses the following extensions:

  * vnd.apache.arrow.file: .arrow
  * vnd.apache.arrow.stream: Not used (in-memory buffer is used)

https://github.com/apache/arrow/blob/master/docs/source/python/filesystems.rst

   with local.open_output_stream("test.arrow") as file:

Our Go tests use the following extensions:

  * vnd.apache.arrow.file: Not used (no extension)
  * vnd.apache.arrow.stream: Not used (no extension)

Our Java tests use the following extensions:

  * vnd.apache.arrow.file: .arrow
  * vnd.apache.arrow.stream: .arrow but most of tests use in-memory buffer

https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowFile.java#L51

    File file = new File("target/mytest_write.arrow");

https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestRoundTrip.java#L176

    final File temp = File.createTempFile("arrow-test-" + name + "-", ".arrow");

Our JavaScript tests use the following extensions:

  * vnd.apache.arrow.file: Not used (in-memory buffer is used)
  * vnd.apache.arrow.stream: Not used (in-memory buffer is used)

Our Julia tests use the following extensions:

  * vnd.apache.arrow.file: Not used (in-memory buffer is used)
  * vnd.apache.arrow.stream: Not used (in-memory buffer is used)

Our Rust tests use the following extensions:

  * vnd.apache.arrow.file: .arrow_file
  * vnd.apache.arrow.stream: .stream

Note that they use data in our integration test.


Thanks,
--
kou

In <cajpuwmckzuppmol-o0+d6fjwk-eas2teyf_pw0qzthhvx-9...@mail.gmail.com>
  "Re: Please Review: Application for a Media Type" on Fri, 22 Jan 2021 
14:37:35 -0600,
  Wes McKinney <wesmck...@gmail.com> wrote:

> Thank you for taking the lead on this. I gave a brief read through and
> I think it makes sense using Thrift or Protocol Buffers as a
> guideline. Would be good for some others to review who might be
> familiar with IANA media formats
> 
> On Wed, Jan 20, 2021 at 6:17 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>> Per a previous discussion
>> (https://lists.apache.org/thread.html/b15726d0c0da2223ba1b45a226ef86263f688b20532a30535cd5e267%40%3Cdev.arrow.apache.org%3E)
>> and the resulting JIRA issue ARROW-7396
>> (https://issues.apache.org/jira/browse/ARROW-7396) there is a desire
>> to register the arrow format with the IANA as a formal media type
>> (actually two media types, one for the streaming format and one for
>> the file format).
>>
>> The form for applying is here: https://www.iana.org/form/media-types
>>
>> I have created a draft registration document (link below).
>>
>> The only fields with any real flexibility are "Security
>> Considerations", "Interoperability Considerations", and "Application
>> Usage".  I reviewed the applications for XML, JSON, and Thrift and
>> I've made a best attempt at these fields as well as posted examples
>> from the other languages.  Please review and feel free to suggest
>> changes.
>>
>> https://docs.google.com/document/d/1PmZFoSifV_TX4vXnv775WiOtqCgz5zLF5ryFRWio3HQ/edit?usp=sharing
>>
>> One we align on the content we should probably have a PMC member
>> actually make the submission and be listed as contact person.
>>
>> Thanks,
>>
>> Weston Pace
>> Ursa Computing

Reply via email to