Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Steve Kim
Thank you for asking this question. I have the same question. I noted a similar problem in the c++/python implementation: https://github.com/apache/arrow/issues/19157#issuecomment-1528037394 On Tue, Apr 2, 2024, 04:30 Finn Völkel wrote: > Hi, > > my question primarily concerns the union layout

row counts in footer of IPC file format

2023-03-18 Thread Steve Kim
Hello everyone, I would like to be able to quickly seek to an arbitrary row in an Arrow file. With the current file format, reading the file footer alone is not enough to determine the record batch that contains a given row index. The row counts of the record batches are only found in the metadat

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Steve Kim
I prefer the lz4 frame format for the reasons that Antoine stated. To be friendly to users, the Arrow IPC documentation could mention that lz4 compression may break Java interoperability. If block dependency is the only obstacle to Java interoperability, the Arrow IPC implementation could disable

Re: java arrow: memory management with multiple references to same batch

2021-01-29 Thread Steve Kim
I recommend that you direct these questions to u...@arrow.apache.org (https://mail-archives.apache.org/mod_mbox/arrow-user/). On Fri, Jan 29, 2021 at 7:07 AM Joris Peeters wrote: > > Hello, > > I'm writing an HTTP server in Java that provides Arrow data to users. For > performance, I keep the mo

Re: [C++] read Parquet columns into 64-bit offset types

2021-01-17 Thread Steve Kim
> This should be possible already, at least on git master but perhaps also > in 2.0.0. Which problem are you encountering? With pyarrow 2.0.0, I encountered the following: ``` >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import pyarrow.dataset as ds >>> pa.__version__ '2.0.0' >>

[C++] read Parquet columns into 64-bit offset types

2021-01-08 Thread Steve Kim
more generally by enabling users to specify type coercion/promotion when mapping Parquet types to Arrow types. Are other users interested in this feature? Is anyone opposed? Steve Kim

language independent representation of filter expressions

2020-07-06 Thread Steve Kim
I have been following the discussion on a pull request ( https://github.com/apache/arrow/pull/7030) by Hongze Zhang to use the high-level dataset API via JNI. An obstacle that was encountered in this PR is that there is not a good way to pass a filter expression via JNI. Expressions have a defined

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
> Would that keep compatibility with existing files produces by Parquet C++? Changing the lz4 implementation to be compatible with parquet-mr/hadoop would break compatibility with any existing files that were written by Parquet C++ using lz4 compression. I believe that it is not possible to reliab

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Steve Kim
The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec. I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when poss