hashing Arrow structures

2023-07-21 Thread Yaron Gvili
Hi, What are the recommended ways to hash Arrow structures? What are the pros and cons of each approach? Looking a bit through the code, I've so far found two different hashing approaches, which I describe below. Are there any others? A first approach I found is using `Hashing32` and

Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Yaron Gvili
Congrats, Will! Yaron. From: Rok Mihevc Sent: Monday, March 13, 2023 9:48 PM To: dev@arrow.apache.org Subject: Re: [ANNOUNCE] New Arrow PMC member: Will Jones Congratulations Will! Rok On Mon, Mar 13, 2023 at 8:37 PM Steph Hazlitt wrote: > Congrats Will! >

testing of back-pressure

2023-02-16 Thread Yaron Gvili
Hi, What testing of back-pressure exist in Acero? I'm mostly interested in testing of back-pressure that applies to any ExecNode, but could also learn from more specific testing. If this is not well covered, I'd look into implementing such testing. Cheers, Yaron.

Re: Build issues (Protobuf internal symbols)

2023-02-13 Thread Yaron Gvili
@Li, my understanding is that if you generate headers from the same Arrow protobuf files using the same toolchain (including protoc) used in the Arrow build, then you will be able to use these headers to correctly access protobuf objects within Arrow structures. This doesn't unhide symbols in

Re: measuring memory usage of Arrow structures

2022-10-28 Thread Yaron Gvili
ol_jemalloc.cc#L157 >> [2] >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool_test.cc >> >> On Fri, Oct 28, 2022 at 3:10 PM Yaron Gvili wrote: >> >>> Hi, >>> >>> Is there a supported/convenient way for mea

measuring memory usage of Arrow structures

2022-10-28 Thread Yaron Gvili
Hi, Is there a supported/convenient way for measuring the memory usage of Arrow structures? For my specific use case, measuring memory usage of either a record batch or an array would be sufficiently convenient. Cheers, Yaron.

Re: archery-lint unknown targets

2022-10-21 Thread Yaron Gvili
. Le 21/10/2022 à 17:49, Yaron Gvili a écrit : > Hi, > > I got the errors below from `archery lint --cpplint --clang-format > --clang-tidy` and I'm wondering how to figure them out. > > ninja: error: unknown target 'check-format' > ninja: error: unknown target 'check-clang-tid

archery-lint unknown targets

2022-10-21 Thread Yaron Gvili
Hi, I got the errors below from `archery lint --cpplint --clang-format --clang-tidy` and I'm wondering how to figure them out. ninja: error: unknown target 'check-format' ninja: error: unknown target 'check-clang-tidy' Note that I am working on a modified local clone of a fork of Arrow, so

Re: [Acero] Error handling in ExecNode

2022-10-18 Thread Yaron Gvili
Hi Li, One way I've seen (which hopefully is the right way) is invoking `ExecNode::ErrorIfNotOk(Status)`. If `WriteRecordBatch` returns a `Status` then just pass it; if it returns a `Result` then you can pass its `.status()`. Yaron. From: Li Jin Sent:

Re: Register custom ExecNode factories

2022-09-28 Thread Yaron Gvili
I agree with Weston about dynamically loading a shared object with initialization code for registering node factories. For custom node factories, I think this loading would best be done from a separate Python module, different than "_exec_plan.pyx", that the user would need to import for

Re: unclear compilation errors with util::optional

2022-09-22 Thread Yaron Gvili
and therefore removed compatibility backports such as arrow::util::optional. Now you should just use std::optional. So be sure to rebase your work on master and fix any reference to those compatibility backports in your code. Regards Antoine. Le 22/09/2022 à 10:26, Yaron Gvili a écrit : > Hi, > >

unclear compilation errors with util::optional

2022-09-22 Thread Yaron Gvili
Hi, In a PR I'm working on [1], I get compilation errors in CI jobs that I don't see the reason for. I'd appreciate help with this. For example, one job's [2] compilation complains about the util::optional symbol not being declared (this happens in other jobs too). This is unclear for a

Re: apparently misleading test assertion printout

2022-09-19 Thread Yaron Gvili
Surely not right away, but we'll see :) Yaron. From: Antoine Pitrou Sent: Monday, September 19, 2022 4:09 AM To: dev@arrow.apache.org Subject: Re: apparently misleading test assertion printout Le 19/09/2022 à 10:05, Yaron Gvili a écrit : > Hi Antoine, >

Re: apparently misleading test assertion printout

2022-09-19 Thread Yaron Gvili
to print out a value. Guidance to fix this at: https://github.com/google/googletest/blob/main/docs/advanced.md#teaching-googletest-how-to-print-your-values Regards Antoine. Le 19/09/2022 à 09:54, Yaron Gvili a écrit : > Hi, > > In my local code, I observed a test assertion printout t

apparently misleading test assertion printout

2022-09-19 Thread Yaron Gvili
Hi, In my local code, I observed a test assertion printout that seems misleading. The printout looks like this: Expected equality of these values: expected_empty_segment Which is: 24-byte object <00-00 00-00 00-00 00-00 02-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> empty_segment

Re: Integration between Flight and Acero

2022-09-13 Thread Yaron Gvili
cing seems quite a bit of change (also I don't think SourceNode is exposed via public header). But let me know if you think I am missing something. Li On Tue, Sep 6, 2022 at 4:57 AM Yaron Gvili wrote: > Hi Li, > > Here's my 2 cents about the Ibis/Substrait part of this. > > An Ibis ex

Re: PyArrow build problem

2022-09-13 Thread Yaron Gvili
and share the URL? Or you can open a Jira issue and continue this because we can attach files on Jira. Thanks, -- kou In "Re: PyArrow build problem" on Mon, 12 Sep 2022 19:36:51 +0000, Yaron Gvili wrote: > Hi Kou, > > I'm attaching the cmake log files for the sam

Re: PyArrow build problem

2022-09-12 Thread Yaron Gvili
2022 13:57:53 +, Yaron Gvili wrote: > Hi All, > > I got the error below while running "cd python/ && python setup.py build_ext > --inplace", after successfully building and installing a recent master > version (a63e60bad89b41266d155bc496eb383765702492)

Re: PyArrow build problem

2022-09-11 Thread Yaron Gvili
couldn't get it to build nor find instructions for doing so, and I suspect the build problem I described earlier here would get fixed with this build. Could anyone explain how the PyArrow build works now? Cheers, Yaron. From: Yaron Gvili Sent: Sunday, September 11

PyArrow build problem

2022-09-11 Thread Yaron Gvili
Hi All, I got the error below while running "cd python/ && python setup.py build_ext --inplace", after successfully building and installing a recent master version (a63e60bad89b41266d155bc496eb383765702492) of Arrow C++ under a pyarrow-dev Conda environment, as in the Python dev doc

Re: design for ordered aggregation

2022-09-07 Thread Yaron Gvili
simplification / cleaning pass. Probably only benchmarks will really say. On Tue, Sep 6, 2022 at 10:26 AM Yaron Gvili wrote: > > Hi All, > > I'm working on a design for ordered aggregations in Arrow C++ and would like > to get some opinions about it. Ordered aggregation is sim

design for ordered aggregation

2022-09-06 Thread Yaron Gvili
Hi All, I'm working on a design for ordered aggregations in Arrow C++ and would like to get some opinions about it. Ordered aggregation is similar to grouped aggregation except that one column in the grouping key is (known to be) ordered. The result of both types of aggregations is the same

Re: Integration between Flight and Acero

2022-09-06 Thread Yaron Gvili
Hi Li, Here's my 2 cents about the Ibis/Substrait part of this. An Ibis expression carries a schema. If you're planning to create an integrated Ibis/Substrait/Arrow solution, then you'll need the schema to be available to Ibis in Python. So, you'll need a Python wrapper for the C++

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Yaron Gvili
Congratulations Weston! From: Raul Cumplido Dominguez Sent: Monday, September 5, 2022 10:04 AM To: dev@arrow.apache.org Subject: Re: [ANNOUNCE] New Arrow PMC member: Weston Pace Congratulations Weston! On Mon, Sep 5, 2022 at 3:37 PM Niranda Perera wrote: >

Re: [C++] Read Flight data source into Acero

2022-08-18 Thread Yaron Gvili
I have code in source_node.cc in a local branch adding factories for other sources in SourceNode (e.g., streams of RecordBatch, ExecBatch, or ArrayVector) which I could make a PR for, if there is interest. Yaron. From: David Li Sent: Wednesday, August 17, 2022

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Yaron Gvili
of nightly (crossbow) test with a longer timeout but I don't know that we've had to resort to that yet. On Wed, Aug 17, 2022 at 3:41 AM Yaron Gvili wrote: > > It looks like the test normally takes less than a second. The gap in > running-time is not surprising because the tests I locally ad

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Yaron Gvili
are. Yaron. From: Li Jin Sent: Wednesday, August 17, 2022 9:04 AM To: dev@arrow.apache.org Subject: Re: dealing with tester timeout in a CI job Yaron, how does the asof join tests normally take? On Wed, Aug 17, 2022 at 6:13 AM Yaron Gvili wrote: > Sorry, yes

Re: dealing with tester timeout in a CI job

2022-08-17 Thread Yaron Gvili
the URL of the failed macOS related CI job? Thanks, -- kou In "dealing with tester timeout in a CI job" on Tue, 16 Aug 2022 16:34:24 +0000, Yaron Gvili wrote: > Hi, > > What are some acceptable ways to handle a timeout failure in a CI job for a > tester I implemented

dealing with tester timeout in a CI job

2022-08-16 Thread Yaron Gvili
Hi, What are some acceptable ways to handle a timeout failure in a CI job for a tester I implemented? For reference, I got such a timeout for only one MacOS related CI job, while the other CI jobs did not get such a timeout. Let's assume that I cannot (easily) make the tests run any faster. Is

Re: [C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Yaron Gvili
aps we can look into adding an option to Source node to ensure "sequential".. Li On Mon, Jul 25, 2022 at 11:18 AM Yaron Gvili wrote: > I've also been using source node with a generator, but observed batches in > random order (in a 1-to-2-months old version of Arrow). So, I'd be &

Re: [C++] Clarifying the behavior of source node and executor

2022-07-25 Thread Yaron Gvili
I've also been using source node with a generator, but observed batches in random order (in a 1-to-2-months old version of Arrow). So, I'd be surprised if ordering is guaranteed, and I'm also interested in how to obtain such a guarantee. Yaron. From: Li Jin

Re: [C++] Moving from -O3 to -O2 optimization level in release builds

2022-07-21 Thread Yaron Gvili
> only enable -O3 on source files selectively that can be demonstrated to > benefit from it Unfortunately, actual benefits from -O3 are application dependent. As https://www.linuxjournal.com/article/7269 explains: "Although -O3 can produce fast code, the increase in the size of the image can

Re: ExecutionContext, batch ordering clarification

2022-07-19 Thread Yaron Gvili
Hi, I also have a related question: could you recommend a way to get the batches in order when using a source node? If necessary, a way that involves changing or wrapping the source node's code is acceptable. Yaron. From: Li Jin Sent: Tuesday, July 19, 2022

Re: [C++] Question about substrait dependency in C++

2022-07-18 Thread Yaron Gvili
++] Question about substrait dependency in C++ Thanks both! Let me try changing ARROW_SUBSTRAIT_URL. Should I set ARROW_SUBSTRAIT_URL just to local substrait tarball or sth else? On Mon, Jul 18, 2022 at 2:28 PM Yaron Gvili wrote: > Hi Li, > > I was just writing this. > > AFAIK, curren

Re: [C++] Question about substrait dependency in C++

2022-07-18 Thread Yaron Gvili
Hi Li, I was just writing this. AFAIK, currently the Arrow C++ build system does not take prebuilt Substrait C++ classes. The usual way is rebuilding Arrow C++ with a custom Substrait repository, which is done by setting ARROW_SUBSTRAIT_URL to a local Substrait repository. You can download

Re: cpp: Debugging 'plan destruction before finishing'

2022-07-15 Thread Yaron Gvili
I ran into similar issues where a bug in a node's code led to an error that caused difficult-to-debug hangs or crashes during execution. I think a common problem with diagnosing such issues is that error messages (within Status instances) during execution do not always get communicated. Perhaps

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-07 Thread Yaron Gvili
ly 7, 2022 5:15 AM To: dev@arrow.apache.org Subject: Re: accessing Substrait protobuf Python classes from PyArrow Hi Yaron, Le 07/07/2022 à 10:48, Yaron Gvili a écrit : > It looks like the main decision to make is whether accessing Substrait > protobuf Python classes from PyArrow is

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-07 Thread Yaron Gvili
pendency? TMK, that should be fine for a build dependency. [1] https://github.com/apache/arrow/pull/13500 On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili wrote: > > Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ > license. Is this license acceptable for a build d

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-06 Thread Yaron Gvili
Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+ license. Is this license acceptable for a build dependency? Yaron. From: Yaron Gvili Sent: Wednesday, July 6, 2022 7:26 AM To: dev@arrow.apache.org Subject: Re: accessing Substrait

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-06 Thread Yaron Gvili
thon code > than using Cython or C++. I'm not quite certain why this requires a modification to the plan. On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili wrote: > > @Li, yes though in a new way. This came up in a data-source UDF scenario > where the implementation is a Python stream fa

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-05 Thread Yaron Gvili
? On Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili wrote: > This rewriting of the package is basically what I had in mind; the `_ep` > was just to signal a private package, which cannot be enforced, of course. > Assuming this rewriting would indeed avoid conflict with any standard > protobuf pa

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-04 Thread Yaron Gvili
t looks like pyarrow currently only depends on numpy, which is pretty awesome... so I feel like we should keep it that way. Not sure what the best course of action is. Jeroen On Sun, 3 Jul 2022 at 22:55, Yaron Gvili wrote: > Thanks, the Google protobuf exposure concerns are clear. Another concer

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-03 Thread Yaron Gvili
t; several months (though I have a PR open for Substrait 0.6). I guess from > that point of view distributing the right version along with pyarrow seems > nice, but the issues of Google's protobuf implementation remain. This being > an issue at all is also very much a Substrait problem, not an

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-03 Thread Yaron Gvili
sound like a reasonable approach? Yaron. ____ From: Yaron Gvili Sent: Saturday, July 2, 2022 8:55 AM To: dev@arrow.apache.org ; Phillip Cloud Subject: Re: accessing Substrait protobuf Python classes from PyArrow I'm somewhat confused by this answer because I think resolving t

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-02 Thread Yaron Gvili
plication import both pyarrow and substrait-python independently. Perhaps @Phillip Cloud or someone from the Ibis space might have some ideas on where this might be found. -Weston On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili wrote: > > Hi, > > Is there support for accessing Su

accessing Substrait protobuf Python classes from PyArrow

2022-06-30 Thread Yaron Gvili
Hi, Is there support for accessing Substrait protobuf Python classes (such as Plan) from PyArrow? If not, how should such support be added? For example, should the PyArrow build system pull in the Substrait repo as an external project and build its protobuf Python classes, in a manner similar

Re: user-defined Python-based data-sources in Arrow

2022-06-24 Thread Yaron Gvili
rds, can we call the function before the previous call finishes if we want to read the source in parallel? [1] https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219 On Wed, Jun 22, 2022 at 9:47 AM Yaron Gvili wrote: > > Sure, it can be f

Re: user-defined Python-based data-sources in Arrow

2022-06-22 Thread Yaron Gvili
discussion here? On Wed, Jun 22, 2022 at 11:40 AM Yaron Gvili wrote: > Hi, > > I'd like to get the community's feedback about a design proposal > (discussed below) for integrating user-defined Python-based data-sources in > Arrow. This is part of a larger project I'm working on t

user-defined Python-based data-sources in Arrow

2022-06-22 Thread Yaron Gvili
Hi, I'd like to get the community's feedback about a design proposal (discussed below) for integrating user-defined Python-based data-sources in Arrow. This is part of a larger project I'm working on to provide end-to-end (Ibis/Ibis-Substrait/Arrow) support for such data-sources. A

problem building Arrow under pyarrow-dev in debug-mode

2022-06-07 Thread Yaron Gvili
Hi, I tried following the instruction in Python development page and ran into a problem building Arrow under pyarrow-dev in debug-mode. What am I doing wrong? For the release-mode, which does build and run OK, I use the following commands: $ cmake -GNinja -DCMAKE_INSTALL_PREFIX=$ARROW_HOME

Re: arithmetic manipulation of PyArrow numeric arrays

2022-06-07 Thread Yaron Gvili
at 2:54 PM Yaron Gvili wrote: > Hi, > > This is likely a question (or two) with a simple answer that I couldn't > easily find. While working with PyArrow UDFs, I tried implementing a simple > UDF (see first function below) and noticed that it failed upon receiving a > pyarrow.lib

arithmetic manipulation of PyArrow numeric arrays

2022-06-06 Thread Yaron Gvili
Hi, This is likely a question (or two) with a simple answer that I couldn't easily find. While working with PyArrow UDFs, I tried implementing a simple UDF (see first function below) and noticed that it failed upon receiving a pyarrow.lib.DoubleArray which cannot be directly manipulated with

Re: data-source UDFs

2022-06-06 Thread Yaron Gvili
for users to follow than just registering a type with a defined interface. 4. Is there a particular reason in your use case for using the function registry for this? 5. Do you imagine these UDFs would always be specific to particular users? Or would it be possible for such a UDF to be shared as

Re: data-source UDFs

2022-06-04 Thread Yaron Gvili
Thanks for the detailed overview, Weston. I agree with David this would be very useful to have in a public doc. Weston and David's discussion is a good one, however, I see it as separate from the discussion I brought up. The former is about facilities (like extension points) for implementing

data-source UDFs

2022-06-03 Thread Yaron Gvili
Hi, I'm working on support for data-source UDFs and would like to get feedback about the design I have in mind for it. By support for data-source UDFs, at a basic level, I mean enabling a user to define using PyArrow APIs a record-batch-generating function implemented in Python that would be

Re: design for Python UDFs in an Ibis/Substrait/Arrow workflow

2022-05-25 Thread Yaron Gvili
tion/expression in Arrow compute and > execute > > the UDF (either using the approach in the current Scalar UDF prototype or > > do sth else) > > (Same as the Yaron layout above). > > > > Now I think we have reasonable solutions are (1) and (2) (at least for > PoC >

design for Python UDFs in an Ibis/Substrait/Arrow workflow

2022-05-15 Thread Yaron Gvili
Hi, I'm working on a Python UDFs PoC and would like to get the community's feedback on its design. The goal of this PoC is to enable a user to integrate Python UDFs in an Ibis/Substrait/Arrow workflow. The basic idea is that the user would create an Ibis expression that includes Python UDFs

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
t Yaron. ____ From: Yaron Gvili Sent: Tuesday, May 10, 2022 1:24 PM To: dev@arrow.apache.org Subject: Re: PyArrow builds but fails to load pyarrow._dataset > Does `import pyarrow` work? Yes. Also, all but one unit te

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
;> export PYARROW_WITH_DATASET=1 >> >> On Tue, May 10, 2022 at 7:07 AM Yaron Gvili wrote: >>> >>> Hello, >>> >>> I ran into a problem with running PyArrow that I locally built. The build >>> worked fine (or so it seems) but then

PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Yaron Gvili
Hello, I ran into a problem with running PyArrow that I locally built. The build worked fine (or so it seems) but then the testing procedure had a failure due to not being able to load pyarrow._dataset, which I manually confirmed. I'd appreciate any guidance on how to fix this error. Below

Re: ExecBatch in arrow execution engine

2022-05-09 Thread Yaron Gvili
Hi Yue, >From my limited experience with the execution engine, my understanding is that >the API allows streaming only an ExecBatch from one node to another. A >possible solution is to derive from ExecBatch your own class (say) >RichExecBatch that carries any extra metadata you want. If in

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-06 Thread Yaron Gvili
The general design seems reasonable to me. However, I think the multithreading issue warrants a (perhaps separate) discussion, in view of the risk that Arrow's multithreading model would end up being hard to interoperate with that of other libraries used to implement UDFs. Such interoperability

Re: [C++] output field names in Arrow Substrait

2022-04-20 Thread Yaron Gvili
t; > see why anyone would want to match *generated* names in any > functional > >>> > way; that's a recipe for undefined behavior. > >>> > > >>> > None of Substrait's built-in things make use of column names though, > >>> > wh

Re: [C++] output field names in Arrow Substrait

2022-04-20 Thread Yaron Gvili
nput's columns whose name starts > with the given string-name or a node that operates on an input > column whose name is given as data in another input column. In both of those cases the field names are not part of the plan itself. On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili wrote: > &g

Re: [C++] output field names in Arrow Substrait

2022-04-19 Thread Yaron Gvili
ity would be welcome, but I don't know if anyone working on the Arrow/Substrait integration has that goal in mind. If that is your goal I might be curious to learn more about your use cases. On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili wrote: > > Hi, > > > We ran into an issue due to th

[C++] output field names in Arrow Substrait

2022-04-19 Thread Yaron Gvili
Hi, We ran into an issue due to the fact that, for intermediate relations, Substrait does not automatically compute output field names nor allows one to explicitly name output fields [1]. This leads to trouble when one needs to refer to these output fields by name [2]. We run into this

Re: [C++] Build/Link against master / custom branch

2022-02-04 Thread Yaron Gvili
Hello, On Ubuntu, I managed to get a local external (i.e., to Arrow) project to build against a locally and custom-built Arrow project working using the following: * Locally build Arrow and install it to a directory $ARROW_ROOT_DIR * Configure the external project build using: cmake