Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

2020-11-04 Thread Jason Sachs
GAH! It looks like it might be my problem, not pyarrow; type code S is a null-terminated data: https://numpy.org/doc/stable/reference/arrays.dtypes.html 'S', 'a' zero-terminated bytes (not recommended) Now I have to figure out why I'm getting that S code (it's generated through some sort of ope

Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

2020-11-04 Thread Jason Sachs
> Seems a bit buggy Yeah that's a bit of an understatement :/ Done. https://issues.apache.org/jira/browse/ARROW-10498 I'm trying to poke around, but it looks like it may affect all of the from_* methods. I don't grok Cython very well, so am not sure I can get to a root cause easily. On 2020/

Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

2020-11-04 Thread Wes McKinney
Seems a bit buggy, can you open a Jira issue? Thanks On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs wrote: > > It looks like pyarrow.Table.from_pydict() cuts off binary data after an > embedded 00 byte. Is this a known bug? > > (py3) C:\>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.

bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

2020-11-04 Thread Jason Sachs
It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug? (py3) C:\>python Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more informati

Re: Compressing parquet metadata?

2020-11-04 Thread Jason Sachs
Yes. Ouch, so there's a 4/3 hit there for base64. (is that always the case or does it use plaintext if possible?) I'm trying to figure out what kind of request to file in the issue tracker to help support my use case. (data logging) I have enough stuff I want to put in metadata that the use of

Re: Compressing parquet metadata?

2020-11-04 Thread Wes McKinney
You mean the key-value metadata at the schema/field-level? That can be binary (it gets base64-encoded when written to Parquet) On Wed, Nov 4, 2020 at 10:22 AM Jason Sachs wrote: > > OK. If I take the manual approach, do parquet / arrow care whether metadata > is binary or not? > > On 2020/11/04

Re: Compressing parquet metadata?

2020-11-04 Thread Jason Sachs
OK. If I take the manual approach, do parquet / arrow care whether metadata is binary or not? On 2020/11/04 14:16:37, Wes McKinney wrote: > There is not to my knowledge. > > On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs wrote: > > > > Is there any built-in method to compress parquet metadata? Fr

Re: Compressing parquet metadata?

2020-11-04 Thread Wes McKinney
There is not to my knowledge. On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs wrote: > > Is there any built-in method to compress parquet metadata? From what I can > tell, the main table columns are compressed, but not the metadata. > > I have metadata which includes 100-200KB of text (JSON format) t

Arrow java implementation: Compatible IO streams.

2020-11-04 Thread Saloni Udani
Hello, I have a use case where I want to write an arrow batch to my existing output stream (custom stream extending java.io.OutputStream) and reading from my existing input stream (custom stream extending java.io.InputStream). I used ArrowStreamWriter and ArrowStreamReader but on the reader side