Hi,
What are the recommended ways to hash Arrow structures? What are the pros and
cons of each approach?
Looking a bit through the code, I've so far found two different hashing
approaches, which I describe below. Are there any others?
A first approach I found is using `Hashing32` and `Hashing6
Congrats, Will!
Yaron.
From: Rok Mihevc
Sent: Monday, March 13, 2023 9:48 PM
To: dev@arrow.apache.org
Subject: Re: [ANNOUNCE] New Arrow PMC member: Will Jones
Congratulations Will!
Rok
On Mon, Mar 13, 2023 at 8:37 PM Steph Hazlitt
wrote:
> Congrats Will!
>
Hi,
What testing of back-pressure exist in Acero? I'm mostly interested in testing
of back-pressure that applies to any ExecNode, but could also learn from more
specific testing. If this is not well covered, I'd look into implementing such
testing.
Cheers,
Yaron.
@Li, my understanding is that if you generate headers from the same Arrow
protobuf files using the same toolchain (including protoc) used in the Arrow
build, then you will be able to use these headers to correctly access protobuf
objects within Arrow structures. This doesn't unhide symbols in th
arrow/memory_pool_jemalloc.cc#L157
>> [2]
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool_test.cc
>>
>> On Fri, Oct 28, 2022 at 3:10 PM Yaron Gvili wrote:
>>
>>> Hi,
>>>
>>> Is there a supported/convenient
Hi,
Is there a supported/convenient way for measuring the memory usage of Arrow
structures? For my specific use case, measuring memory usage of either a record
batch or an array would be sufficiently convenient.
Cheers,
Yaron.
toine.
Le 21/10/2022 à 17:49, Yaron Gvili a écrit :
> Hi,
>
> I got the errors below from `archery lint --cpplint --clang-format
> --clang-tidy` and I'm wondering how to figure them out.
>
> ninja: error: unknown target 'check-format'
> ninja: error: unknown tar
Hi,
I got the errors below from `archery lint --cpplint --clang-format
--clang-tidy` and I'm wondering how to figure them out.
ninja: error: unknown target 'check-format'
ninja: error: unknown target 'check-clang-tidy'
Note that I am working on a modified local clone of a fork of Arrow, so
rep
Hi Li,
One way I've seen (which hopefully is the right way) is invoking
`ExecNode::ErrorIfNotOk(Status)`. If `WriteRecordBatch` returns a `Status` then
just pass it; if it returns a `Result` then you can pass its `.status()`.
Yaron.
From: Li Jin
Sent: Tuesday,
I agree with Weston about dynamically loading a shared object with
initialization code for registering node factories. For custom node factories,
I think this loading would best be done from a separate Python module,
different than "_exec_plan.pyx", that the user would need to import for
trigge
therefore removed
compatibility backports such as arrow::util::optional. Now you should
just use std::optional.
So be sure to rebase your work on master and fix any reference to those
compatibility backports in your code.
Regards
Antoine.
Le 22/09/2022 à 10:26, Yaron Gvili a écrit :
> Hi,
>
>
Hi,
In a PR I'm working on [1], I get compilation errors in CI jobs that I don't
see the reason for. I'd appreciate help with this.
For example, one job's [2] compilation complains about the util::optional
symbol not being declared (this happens in other jobs too). This is unclear for
a couple
Surely not right away, but we'll see :)
Yaron.
From: Antoine Pitrou
Sent: Monday, September 19, 2022 4:09 AM
To: dev@arrow.apache.org
Subject: Re: apparently misleading test assertion printout
Le 19/09/2022 à 10:05, Yaron Gvili a écrit :
> Hi Ant
t know how to print out a
value. Guidance to fix this at:
https://github.com/google/googletest/blob/main/docs/advanced.md#teaching-googletest-how-to-print-your-values
Regards
Antoine.
Le 19/09/2022 à 09:54, Yaron Gvili a écrit :
> Hi,
>
> In my local code, I observed a test assertion
Hi,
In my local code, I observed a test assertion printout that seems misleading.
The printout looks like this:
Expected equality of these values:
expected_empty_segment
Which is: 24-byte object <00-00 00-00 00-00 00-00 02-00 00-00 00-00 00-00
00-00 00-00 00-00 00-00>
empty_segment
ode::StartProducing seems quite a bit of change (also I don't think
SourceNode is exposed via public header). But let me know if you think I am
missing something.
Li
On Tue, Sep 6, 2022 at 4:57 AM Yaron Gvili wrote:
> Hi Li,
>
> Here's my 2 cents about the Ibis/Substrait part of t
t and share
the URL? Or you can open a Jira issue and continue this
because we can attach files on Jira.
Thanks,
--
kou
In
"Re: PyArrow build problem" on Mon, 12 Sep 2022 19:36:51 +,
Yaron Gvili wrote:
> Hi Kou,
>
> I'm attaching the cmake log files f
un, 11 Sep 2022 13:57:53 +,
Yaron Gvili wrote:
> Hi All,
>
> I got the error below while running "cd python/ && python setup.py build_ext
> --inplace", after successfully building and installing a recent master
> version (a63e60bad89b41266d155bc496eb3837
couldn't get it to build nor find instructions for doing so, and I suspect the
build problem I described earlier here would get fixed with this build.
Could anyone explain how the PyArrow build works now?
Cheers,
Yaron.
From: Yaron Gvili
Sent: Sunday, Septemb
Hi All,
I got the error below while running "cd python/ && python setup.py build_ext
--inplace", after successfully building and installing a recent master version
(a63e60bad89b41266d155bc496eb383765702492) of Arrow C++ under a pyarrow-dev
Conda environment, as in the Python dev doc
(https://a
simplification / cleaning pass. Probably only benchmarks will really
say.
On Tue, Sep 6, 2022 at 10:26 AM Yaron Gvili wrote:
>
> Hi All,
>
> I'm working on a design for ordered aggregations in Arrow C++ and would like
> to get some opinions about it. Ordered aggregation i
Hi All,
I'm working on a design for ordered aggregations in Arrow C++ and would like to
get some opinions about it. Ordered aggregation is similar to grouped
aggregation except that one column in the grouping key is (known to be)
ordered. The result of both types of aggregations is the same but
Hi Li,
Here's my 2 cents about the Ibis/Substrait part of this.
An Ibis expression carries a schema. If you're planning to create an integrated
Ibis/Substrait/Arrow solution, then you'll need the schema to be available to
Ibis in Python. So, you'll need a Python wrapper for the C++ implementati
Congratulations Weston!
From: Raul Cumplido Dominguez
Sent: Monday, September 5, 2022 10:04 AM
To: dev@arrow.apache.org
Subject: Re: [ANNOUNCE] New Arrow PMC member: Weston Pace
Congratulations Weston!
On Mon, Sep 5, 2022 at 3:37 PM Niranda Perera
wrote:
> Con
I have code in source_node.cc in a local branch adding factories for other
sources in SourceNode (e.g., streams of RecordBatch, ExecBatch, or ArrayVector)
which I could make a PR for, if there is interest.
Yaron.
From: David Li
Sent: Wednesday, August 17, 2022
ially investigate some kind of
nightly (crossbow) test with a longer timeout but I don't know that
we've had to resort to that yet.
On Wed, Aug 17, 2022 at 3:41 AM Yaron Gvili wrote:
>
> It looks like the test normally takes less than a second. The gap in
> running-time is not sur
s are.
Yaron.
From: Li Jin
Sent: Wednesday, August 17, 2022 9:04 AM
To: dev@arrow.apache.org
Subject: Re: dealing with tester timeout in a CI job
Yaron, how does the asof join tests normally take?
On Wed, Aug 17, 2022 at 6:13 AM Yaron Gvili wrote:
> Sorry, yes,
ld you show the URL of the failed macOS related CI job?
Thanks,
--
kou
In
"dealing with tester timeout in a CI job" on Tue, 16 Aug 2022 16:34:24 +,
Yaron Gvili wrote:
> Hi,
>
> What are some acceptable ways to handle a timeout failure in a CI job for a
> tester I i
Hi,
What are some acceptable ways to handle a timeout failure in a CI job for a
tester I implemented? For reference, I got such a timeout for only one MacOS
related CI job, while the other CI jobs did not get such a timeout.
Let's assume that I cannot (easily) make the tests run any faster. Is
Perhaps we
can look into adding an option to Source node to ensure "sequential"..
Li
On Mon, Jul 25, 2022 at 11:18 AM Yaron Gvili wrote:
> I've also been using source node with a generator, but observed batches in
> random order (in a 1-to-2-months old version of Arrow). So
I've also been using source node with a generator, but observed batches in
random order (in a 1-to-2-months old version of Arrow). So, I'd be surprised if
ordering is guaranteed, and I'm also interested in how to obtain such a
guarantee.
Yaron.
From: Li Jin
Se
> only enable -O3 on source files selectively that can be demonstrated to
> benefit from it
Unfortunately, actual benefits from -O3 are application dependent. As
https://www.linuxjournal.com/article/7269 explains:
"Although -O3 can produce fast code, the increase in the size of the image can
h
Hi,
I also have a related question: could you recommend a way to get the batches in
order when using a source node? If necessary, a way that involves changing or
wrapping the source node's code is acceptable.
Yaron.
From: Li Jin
Sent: Tuesday, July 19, 2022 10
++] Question about substrait dependency in C++
Thanks both! Let me try changing ARROW_SUBSTRAIT_URL. Should I set
ARROW_SUBSTRAIT_URL just to local substrait tarball or sth else?
On Mon, Jul 18, 2022 at 2:28 PM Yaron Gvili wrote:
> Hi Li,
>
> I was just writing this.
>
> AFAIK, curren
Hi Li,
I was just writing this.
AFAIK, currently the Arrow C++ build system does not take prebuilt Substrait
C++ classes. The usual way is rebuilding Arrow C++ with a custom Substrait
repository, which is done by setting ARROW_SUBSTRAIT_URL to a local Substrait
repository. You can download thi
I ran into similar issues where a bug in a node's code led to an error that
caused difficult-to-debug hangs or crashes during execution. I think a common
problem with diagnosing such issues is that error messages (within Status
instances) during execution do not always get communicated. Perhaps
ursday, July 7, 2022 5:15 AM
To: dev@arrow.apache.org
Subject: Re: accessing Substrait protobuf Python classes from PyArrow
Hi Yaron,
Le 07/07/2022 à 10:48, Yaron Gvili a écrit :
> It looks like the main decision to make is whether accessing Substrait
> protobuf Python classes from
v3+
> license. Is this license acceptable for a build dependency?
TMK, that should be fine for a build dependency.
[1] https://github.com/apache/arrow/pull/13500
On Wed, Jul 6, 2022 at 5:07 AM Yaron Gvili wrote:
>
> Regarding Rope that I mentioned earlier in this thread, it has an LGPL
Regarding Rope that I mentioned earlier in this thread, it has an LGPL v3+
license. Is this license acceptable for a build dependency?
Yaron.
From: Yaron Gvili
Sent: Wednesday, July 6, 2022 7:26 AM
To: dev@arrow.apache.org
Subject: Re: accessing Substrait
m using Python code
> than using Cython or C++.
I'm not quite certain why this requires a modification to the plan.
On Tue, Jul 5, 2022 at 7:45 AM Yaron Gvili wrote:
>
> @Li, yes though in a new way. This came up in a data-source UDF scenario
> where the implementation is a P
Mon, Jul 4, 2022 at 1:24 PM Yaron Gvili wrote:
> This rewriting of the package is basically what I had in mind; the `_ep`
> was just to signal a private package, which cannot be enforced, of course.
> Assuming this rewriting would indeed avoid conflict with any standard
> protobuf pa
f their system
library. It looks like pyarrow currently only depends on numpy, which is
pretty awesome... so I feel like we should keep it that way.
Not sure what the best course of action is.
Jeroen
On Sun, 3 Jul 2022 at 22:55, Yaron Gvili wrote:
> Thanks, the Google protobuf exposure concerns
gt; semi-regularly pushes breaking changes and Arrow currently lags behind by
> several months (though I have a PR open for Substrait 0.6). I guess from
> that point of view distributing the right version along with pyarrow seems
> nice, but the issues of Google's protobuf implementation r
und like a reasonable approach?
Yaron.
____
From: Yaron Gvili
Sent: Saturday, July 2, 2022 8:55 AM
To: dev@arrow.apache.org ; Phillip Cloud
Subject: Re: accessing Substrait protobuf Python classes from PyArrow
I'm somewhat confused by this answer because I think resolv
e I would rather prefer that the user application import
both pyarrow and substrait-python independently.
Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.
-Weston
On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili wrote:
>
> Hi,
>
> I
Hi,
Is there support for accessing Substrait protobuf Python classes (such as Plan)
from PyArrow? If not, how should such support be added? For example, should the
PyArrow build system pull in the Substrait repo as an external project and
build its protobuf Python classes, in a manner similar t
be called reentrantly? In other words, can we
call the function before the previous call finishes if we want to read
the source in parallel?
[1]
https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219
On Wed, Jun 22, 2022 at 9:47 AM Yaron G
discussion here?
On Wed, Jun 22, 2022 at 11:40 AM Yaron Gvili wrote:
> Hi,
>
> I'd like to get the community's feedback about a design proposal
> (discussed below) for integrating user-defined Python-based data-sources in
> Arrow. This is part of a larger project I'm
Hi,
I'd like to get the community's feedback about a design proposal (discussed
below) for integrating user-defined Python-based data-sources in Arrow. This is
part of a larger project I'm working on to provide end-to-end
(Ibis/Ibis-Substrait/Arrow) support for such data-sources.
A user-define
Hi,
I tried following the instruction in Python development page and ran into a
problem building Arrow under pyarrow-dev in debug-mode. What am I doing wrong?
For the release-mode, which does build and run OK, I use the following commands:
$ cmake -GNinja -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -DCM
2:54 PM Yaron Gvili wrote:
> Hi,
>
> This is likely a question (or two) with a simple answer that I couldn't
> easily find. While working with PyArrow UDFs, I tried implementing a simple
> UDF (see first function below) and noticed that it failed upon receiving a
> pyarro
Hi,
This is likely a question (or two) with a simple answer that I couldn't easily
find. While working with PyArrow UDFs, I tried implementing a simple UDF (see
first function below) and noticed that it failed upon receiving a
pyarrow.lib.DoubleArray which cannot be directly manipulated with ar
I find it to be more difficult for users to follow than just
registering a type with a defined interface.
4. Is there a particular reason in your use case for using the
function registry for this?
5. Do you imagine these UDFs would always be specific to particular
users? Or would it be possible f
Thanks for the detailed overview, Weston. I agree with David this would be very
useful to have in a public doc.
Weston and David's discussion is a good one, however, I see it as separate from
the discussion I brought up. The former is about facilities (like extension
points) for implementing cu
Hi,
I'm working on support for data-source UDFs and would like to get feedback
about the design I have in mind for it.
By support for data-source UDFs, at a basic level, I mean enabling a user to
define using PyArrow APIs a record-batch-generating function implemented in
Python that would be e
(3)
> > Deserialize the Substrait relation/expression in Arrow compute and
> execute
> > the UDF (either using the approach in the current Scalar UDF prototype or
> > do sth else)
> > (Same as the Yaron layout above).
> >
> > Now I think we have reasonable solutio
Hi,
I'm working on a Python UDFs PoC and would like to get the community's feedback
on its design.
The goal of this PoC is to enable a user to integrate Python UDFs in an
Ibis/Substrait/Arrow workflow. The basic idea is that the user would create an
Ibis expression that includes Python UDFs im
t
Yaron.
____
From: Yaron Gvili
Sent: Tuesday, May 10, 2022 1:24 PM
To: dev@arrow.apache.org
Subject: Re: PyArrow builds but fails to load pyarrow._dataset
> Does `import pyarrow` work?
Yes. Also, all but one unit te
t;>
>> export PYARROW_WITH_DATASET=1
>>
>> On Tue, May 10, 2022 at 7:07 AM Yaron Gvili wrote:
>>>
>>> Hello,
>>>
>>> I ran into a problem with running PyArrow that I locally built. The build
>>> worked fine (or so it seems)
Hello,
I ran into a problem with running PyArrow that I locally built. The build
worked fine (or so it seems) but then the testing procedure had a failure due
to not being able to load pyarrow._dataset, which I manually confirmed. I'd
appreciate any guidance on how to fix this error.
Below are
Hi Yue,
>From my limited experience with the execution engine, my understanding is that
>the API allows streaming only an ExecBatch from one node to another. A
>possible solution is to derive from ExecBatch your own class (say)
>RichExecBatch that carries any extra metadata you want. If in your
The general design seems reasonable to me. However, I think the multithreading
issue warrants a (perhaps separate) discussion, in view of the risk that
Arrow's multithreading model would end up being hard to interoperate with that
of other libraries used to implement UDFs. Such interoperability
nk the
> size
> >>> > explosion Phillip is talking about would be avoided. I *really* don't
> >>> > see why anyone would want to match *generated* names in any
> functional
> >>> > way; that's a recipe for undefined behavior.
>
de that operates on the input's columns whose name starts
> with the given string-name or a node that operates on an input
> column whose name is given as data in another input column.
In both of those cases the field names are not part of the plan itself.
On Tue, Apr 19, 2022 at 9:16
tainly a potential goal, and PRs to add that capability would be
welcome, but I don't know if anyone working on the Arrow/Substrait
integration has that goal in mind. If that is your goal I might be
curious to learn more about your use cases.
On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili wrote:
Hi,
We ran into an issue due to the fact that, for intermediate relations,
Substrait does not automatically compute output field names nor allows one to
explicitly name output fields [1]. This leads to trouble when one needs to
refer to these output fields by name [2]. We run into this trouble
Hello,
On Ubuntu, I managed to get a local external (i.e., to Arrow) project to build
against a locally and custom-built Arrow project working using the following:
* Locally build Arrow and install it to a directory $ARROW_ROOT_DIR
* Configure the external project build using: cmake
-D
67 matches
Mail list logo