Re: [cpp] Alignment and Padding

2022-08-04 Thread James
Perhaps the thing I’m misunderstanding is that the compiler flag in question only pertains to data loaded in an AVX register? On Thu, Aug 4, 2022 at 9:14 PM James wrote: > In the columnar format > doc, it is noted > that buffers ought to be

[cpp] Alignment and Padding

2022-08-04 Thread James
In the columnar format doc, it is noted that buffers ought to be allocated such that they're 64 byte aligned and padded. It is also noted that this allows the use of compiler options such as -qopt-assume-safe-padding. However, my understanding

Re: Issue filtering partitioned Parquet files on partition keys using PyArrow

2022-08-04 Thread David Li
FWIW, we _should_ already perform the "subtree" filtering (see subtree_internal.h [1]) so either it's not the bottleneck or the optimization is not as effective as we would like. Or possibly we need to maintain the files as the tree in the first place instead of trying to recover the structure

Re: Issue filtering partitioned Parquet files on partition keys using PyArrow

2022-08-04 Thread Weston Pace
Awesome. # Partitioning (src/arrow/dataset/partition.h) The first spot to look at might be to understand the Partitioning class. A Partitioning (e.g. hive partitioning, directory partitioning, filename partitioning) has two main methods that convert between a path (e.g.

RE: Issue filtering partitioned Parquet files on partition keys using PyArrow

2022-08-04 Thread Tomaz Maia Suller
Weston, I'm interested in following up. De: Weston Pace Enviado: quinta-feira, 4 de agosto de 2022 12:15 Para: user@arrow.apache.org Assunto: Re: Issue filtering partitioned Parquet files on partition keys using PyArrow Você não costuma receber emails de

Re: Issue filtering partitioned Parquet files on partition keys using PyArrow

2022-08-04 Thread Weston Pace
There is a lot of room for improvement here. In the datasets API the call that you have described (read_parquet) is broken into two steps: * dataset discovery During dataset discovery we don't use any partition filter. The goal is to create the "total dataset" of all the files. So in your

RE: Issue filtering partitioned Parquet files on partition keys using PyArrow

2022-08-04 Thread Tomaz Maia Suller
Hi David, I wonder if the problem with the attachments has to do with the files not having extensions... I'm trying to send them with .prof this time. Anyway: 1. I'm writing to a local filesystem; I've mounted a NFTS partition which is on a HDD. Since the dataset is only ~1.5 GB, I'll try