Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Micah Kornfield
> > > 2. What do we do about different non-utf8 encodings? There does not > appear > > to be a consensus yet on this point. One option is to only allow utf8 > > encoding and force implementers to convert non-utf8 to utf8. Second > option > > is to allow all encodings and capture the encoding in the

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Antoine Pitrou
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : Thanks for all the great feedback. To proceed forward, we seem to need decisions around the following: 1. Whether to use arrow extensions or first class types. The consensus is building towards using arrow extensions. +1 2. What do we do

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-01 Thread Jorge Cardoso Leitão
I am +1 on either - imo: * it is important to have either available * both provide a non-trivial improvement over what we have * the trade-off is difficult to decide upon - I trust whomever is implementing it to experiment and decide which better fits Arrow and the ecosystem. Thank you so much fo

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-01 Thread Wes McKinney
On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou wrote: > > > Hi Wes, > > Le 31/07/2022 à 00:02, Wes McKinney a écrit : > > > > I understand there are still some aspects of this project that cause > > some squeamishness (like having arbitrary memory addresses embedded > > within array values whose l

[QUESTION] How is mmap implemented for 8bit padded files?

2022-08-01 Thread Jorge Cardoso Leitão
Hi, I am trying to follow the C++ implementation with respect to mmap IPC files and reading them zero-copy, in the context of reproducing it in Rust. My understanding from reading the source code is that we essentially: * identify the memory regions (offset and length) of each of the buffers, via

Help with writing/reading from s3

2022-08-01 Thread Li Jin
Hello! We recently updated Arrow to 7.0.0 and hit some error with our old code (Details below). I wonder if there is a new way to do this with the current version? import pyarrow import pyarrow.parquet as pq df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]}) uri = "gs://amp_bucket_liao/tr

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Micah Kornfield
> > It would be reasonable to restrict JSON to utf8, and tell people they > need to transcode in the rare cases where some obnoxious software > outputs utf16-encoded JSON. +1 I think this aligns with the latest JSON RFC [1] as well. Sounds good to me too. +1 on the canonical extension type option

Replace conda with mamba in docs?

2022-08-01 Thread Jacob Wujciak
Hello Everyone, in ARROW-17224 [1] the painfully slow solve times of conda (while installing arrow) were surfaced. The only solution seems to be to use mamba (or wait until the mamba solver is integrated in conda...). As waiting for a very long time is pretty bad U/DX should we recommend mamba in

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-08-01 Thread Antoine Pitrou
Potentially extending the IPC format to support these additional flexibilities is the easy part. The difficult part is to shoehorn the newstanding flexibility into existing APIs, also leaking into the expectations of downstream users. For example, in C++ it is expected that a RecordBatchRea

Re: [VOTE] Release Apache Arrow 9.0.0 - RC2

2022-08-01 Thread Raul Cumplido Dominguez
Hi, +1 (non-binding) TLDR, I've found an issue on Integration tests with UBUNTU 22.04 and openjdk 18 but seems to be around how archery runs integration tests (See JIRA [1]). I've been able to verify the release without issues: TEST_DEFAULT=0 TEST_SOURCE=1 dev/release/verify-release-candidate.s

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-08-01 Thread Andrew Lamb
I like the idea of adding these specialized encodings in some sort of optional extension or wrapper around a RecordBatch, that maybe even isn't standardized at all in Arrow. As Sasha observes, with the notable exception of Dictionary, Arrow has exactly one physical encoding for each logical type.