Re: Should EOS be mandatory for IPC File format?

2019-05-24 Thread John Muehlhausen
I will restructure the existing PR and create new ones (with JIRA) for JS/Java. Just haven't gotten around to it yet. On Wed, May 22, 2019 at 9:10 PM Wes McKinney wrote: > Yes, I think separate JIRA issues for Java and JS would be best. I'd > recommend having one patch for each, so maybe we can

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread Wes McKinney
Yes, I think separate JIRA issues for Java and JS would be best. I'd recommend having one patch for each, so maybe we can just sort out C++ for now On Wed, May 22, 2019 at 3:03 PM John Muehlhausen wrote: > > I added this to https://github.com/apache/arrow/pull/4372 and am hoping CI > will test it

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
I added this to https://github.com/apache/arrow/pull/4372 and am hoping CI will test it for me. Do Java/JS require separate JIRA entries? On Wed, May 22, 2019 at 2:53 PM Wes McKinney wrote: > It doesn't appear that Java writes an EOS (4-byte 0) after the last > record batch in ArrowFileWriter.j

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread Wes McKinney
It doesn't appear that Java writes an EOS (4-byte 0) after the last record batch in ArrowFileWriter.java [1] so may want to open a JIRA to make this simple change. Note that this addition (not even a change) is backward compatible, old readers will be unaffected [1]: https://github.com/apache/ar

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
https://github.com/apache/arrow/pull/4372 First contribution attempt... sorry in advance if I'm not coloring inside the lines! On Wed, May 22, 2019 at 9:06 AM John Muehlhausen wrote: > I will submit a patch once I get set up for that. My crystal ball says > that some people will rely on sequen

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
I will submit a patch once I get set up for that. My crystal ball says that some people will rely on sequential rather than seeking access to File data. E.g. a utility that can either write the same data to a File or to stdout, piped to another process that can read a File from stdin and therefor

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread Wes McKinney
The file format isn't intended to be used in a streaming setting. Use RecordBatchStreamWriter if you need to be able to read a dataset as a stream That being said I don't have a problem with writing the EOS in the file, but the current implementation is not "wrong". On Wed, May 22, 2019 at 8:37 A

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
I believe the change involves updating the File format notes as above, as well as something like the following. The format also mentions "there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somew

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread Antoine Pitrou
Hi John, It the IPC code segfaults on invalid input then it's worth opening an issue on JIRA. Regards Antoine. Le 21/05/2019 à 23:52, John Muehlhausen a écrit : > Wes, > > Check out reader.cpp.  It seg faults when it gets to the next > message-that-is-not-a-message... it is a footer.  But I

Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread Micah Kornfield
This seems like a reasonable change. Is there any reason that we shouldnt always append EOS? On Tuesday, May 21, 2019, John Muehlhausen wrote: > Wes, > > Check out reader.cpp. It seg faults when it gets to the next > message-that-is-not-a-message... it is a footer. But I have no way to > know

Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
Wes, Check out reader.cpp. It seg faults when it gets to the next message-that-is-not-a-message... it is a footer. But I have no way to know this in reader.cpp because I'm piping the File in via stdin. In seeker.cpp I seek to the end and figure out where the footer is (this is a py-arrow-writte

Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread Wes McKinney
hi John, I'm not sure I follow. The EOS you're referring to is part of the streaming format. It's designed to be readable using an InputStream interface that does not support seeking at all. You can see the core logic where messages are popped off the InputStream here https://github.com/apache/ar

Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
https://arrow.apache.org/docs/format/IPC.html#file-format If this stream marker is optional in the file format, doesn't this prevent someone from reading the file without being able to seek() it, e.g. if it is "piped in" to a program? Or otherwise they'll have to stream in the entire thing befo