Re: ExecBatch in arrow execution engine

2022-05-09 Thread Yue Ni
Thanks all for the suggestions. > A possible solution is to derive from ExecBatch your own class I didn't give it a try yet but that is my initial thought and I am not sure if there is idiomatic and better solution in the query engine to do this. > Does the existing filter "guarantee" mechanism w

Re: Flight/FlightSQL Optimization for Small Results?

2022-05-09 Thread Micah Kornfield
I'm sorry I haven't had the time I wanted to spend on implementation, I'd still like to get to it but cannot commit to a timeline for a little bit. if anybody would like to take on the implementation work, I'm happy to review. On Wed, Apr 27, 2022 at 9:06 AM Micah Kornfield wrote: > Yes, next st

Re: Question: What should the offsets buffer be for an empty (list, binary, string) array?

2022-05-09 Thread Micah Kornfield
I think the behavior is undefined. For an empty string array the offsets buffer generally shouldn't be referenced. On Mon, May 9, 2022 at 10:27 AM Sasha Krassovsky wrote: > Hello, > I think an empty string array will have an offsets buffer of length 1 with > the value 0. > > Sasha Krassovsky >

Re: mmap only, read data later?

2022-05-09 Thread Andrew Piskorski
On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote: > Generally, the Arrow IPC file/stream formats are designed for large > data. If you have many very small files you might try to rethink how you > store your data on disk. Ah. Is this because of the overhead of mmap itself, or the

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Sutou Kouhei
Hi, I like this ("Arrow C++ execution engine") or "Arrow C++ Executor" rather than "ACE"/"Arrow C++ Engine"/"Arrow C++ Compute Engine". I think that we don't need an acronym because we don't have an acronym for existing "Arrow C++ Dataset" and "Arrow C++ Filesystem". And it may confuse us and use

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Aldrin
in that vein, I feel like you could also say that "ACE" has an "an" prefix to deflect the connotation of primacy: - An Arrow Compute Engine - An Arrow C++ Compute Engine Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, May 9, 2022 at 2:12 PM Ian Cook wrote: > If we wish for th

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Ian Cook
If we wish for the "C" to stand for both "C++" and "Compute," we could just say that it stands for both, and still use the acronym "ACE"—which is a nice acronym because of its unambigious spelling and pronunciation. That sort of thing has been done before [1] [1] https://english.stackexchange.com

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Jacob Wujciak
AC²E? Phonetically still ACE but visually and meaningwise distinct. On Mon, May 9, 2022 at 10:06 PM Andy Grove wrote: > I also spent a bit of time thinking about this but did not come up with > anything great. I thought about Arrow C++ Compute Engine which is quite an > accurate description but

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Andy Grove
I also spent a bit of time thinking about this but did not come up with anything great. I thought about Arrow C++ Compute Engine which is quite an accurate description but has the awkward acronym ACCE, and then I tried to invent an L to go on the end for ACCEL which is the base of "accelerate", whi

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Antoine Pitrou
Well, in any case, the release manager should make the final call, so a label would mostly be a sophisticated way of pinging them. Le 09/05/2022 à 20:45, Weston Pace a écrit : How should we indicate whether a JIRA is a bugfix, which should be included in the next RC, or something else that

Re: ExecBatch in arrow execution engine

2022-05-09 Thread David Li
Also see this related discussion, which petered out: https://issues.apache.org/jira/browse/ARROW-12873 On Mon, May 9, 2022, at 15:40, Weston Pace wrote: > Any kind of "batch-level" information is a little tricky in the > execution engine because nodes are free to chop up and recombine > batches a

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Ian Cook
Reflecting on this discussion six weeks after Wes’s initial message: I like the “ACE” name. I have been using it to refer to the Arrow C++ execution engine in verbal conversations with contributors, and it has been a much-needed convenient monosyllabic shorthand for a part of the Arrow project that

Re: Arrow C-Data and DuckDB

2022-05-09 Thread Antoine Pitrou
Le 09/05/2022 à 20:28, Tomek Drabas a écrit : I am new to this board so please, let me know if any of this doesn't make sense. I am building a FligthSQL example with DuckDB backend. DuckDB already has an Arrow interface defined in duckdb.h that returns ArrowArray. However, the import is not gu

Re: ExecBatch in arrow execution engine

2022-05-09 Thread Weston Pace
Any kind of "batch-level" information is a little tricky in the execution engine because nodes are free to chop up and recombine batches as they see fit. For example, the output of a join node is going to contain data from at least two different input batches. Even nodes with a single input and s

Re: mmap only, read data later?

2022-05-09 Thread Weston Pace
> Or ways to verify precisely what is happening? Regrettably, mmap is quite difficult to monitor. With strace you can verify the mapping is being setup: strace -y R --no-save < /tmp/script.R 2>&1 | grep -i foo.arrow ... mmap(NULL, 490, PROT_READ, MAP_PRIVATE, 3... Once the mapping i

Re: Arrow C-Data and DuckDB

2022-05-09 Thread Dewey Dunnington
I would also love to see a canonical way to do this! My personal workaround has been to guard my own include with #ifndef ARROW_FLAG_DICTIONARY_ORDERED (but that's clearly a hack). On Mon, May 9, 2022 at 3:28 PM Tomek Drabas wrote: > I am new to this board so please, let me know if any of this d

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Weston Pace
How should we indicate whether a JIRA is a bugfix, which should be included in the next RC, or something else that shouldn't be included in the next RC? Right now I think this is a somewhat manual process with us dropping a note in the Github, or Zulip, or the person packaging the RC using their b

Arrow C-Data and DuckDB

2022-05-09 Thread Tomek Drabas
I am new to this board so please, let me know if any of this doesn't make sense. I am building a FligthSQL example with DuckDB backend. DuckDB already has an Arrow interface defined in duckdb.h that returns ArrowArray. However, the import is not guarded in any way, and ArrowArray is redefined in d

Re: Question: What should the offsets buffer be for an empty (list, binary, string) array?

2022-05-09 Thread Sasha Krassovsky
Hello, I think an empty string array will have an offsets buffer of length 1 with the value 0. Sasha Krassovsky > 9 мая 2022 г., в 05:23, Yang hao <1371656737...@gmail.com> написал(а): > > For an empty (list, binary, string) array, what should the offsets buffer > be? Empty buffer or a buff

Re: mmap only, read data later?

2022-05-09 Thread Sasha Krassovsky
Hi Andrew, Unfortunately mmap is made to implement “transparent paging”, meaning that the OS takes control of when to read pages of the file to and from disk. This means that it’s Arrow has no way of controlling when the file is actually read, and it’s possible that the OS is prefetching the who

Re: mmap only, read data later?

2022-05-09 Thread Antoine Pitrou
Hi Andrew, If the Arrow files are small, chances are the metadata (which is always being read) is as large on disk as the actual data (which is "only" mmap'ed). Also, mmap'ing works on a page granularity (a page being typically 4 kB on x86, sometimes a bit larger on other architectures), an

mmap only, read data later?

2022-05-09 Thread Andrew Piskorski
Hello, I'm using R package arrow_7.0.0.tar.gz, in R 4.1.1, on Linux (Ubuntu 18.04.4 LTS). In R, I am mmap-ing many small Arrow files by calling arrow::read_feather() with as_data_frame=FALSE on each one. Compressed with lz4, each file is quite small, often only 25 kB or so, but I'll often be mmap

Re: RFC: Out of Process Python UDFs in Arrow Compute

2022-05-09 Thread Li Jin
@Vibhatha Wow appreciate the deep thoughts. One thing I'd like to clarify is that we are not trying to run a complicated program inside any kind of UDF. In fact, in our research environment we do not allow people to use MPI/Spark/Flint/Dask inside a UDF. Personally I think it is a bad idea and sho

Re: ExecBatch in arrow execution engine

2022-05-09 Thread Yaron Gvili
Hi Yue, >From my limited experience with the execution engine, my understanding is that >the API allows streaming only an ExecBatch from one node to another. A >possible solution is to derive from ExecBatch your own class (say) >RichExecBatch that carries any extra metadata you want. If in your

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Krisztián Szűcs
Hi, Thanks Raúl for bringing this up since it's an important topic! I'd like to provide more context for your proposal and share my particular problems with the release process. On Mon, May 9, 2022 at 2:33 PM Raul Cumplido wrote: > > Hi, > > I would like to propose a change in our release proces

ExecBatch in arrow execution engine

2022-05-09 Thread Yue Ni
Hi there, I would like to use apache arrow execution engine for some computation. I found `ExecBatch` instead of `RecordBatch` is used for execution engine's node, and I wonder how I can attach some additional information such as schema/metadata for the `ExecBatch` during execution so that they ca

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Antoine Pitrou
+1 from me. I'm actually surprised that we didn't do something like that already. Adding new features from one RC to another sounds like a very bad idea. Regards Antoine. Le 09/05/2022 à 14:33, Raul Cumplido a écrit : Hi, I would like to propose a change in our release process. The rat

Re: [DISC] (Python) Dropping support for manylinux2010

2022-05-09 Thread Joris Van den Bossche
+1 as well Joris On Thu, 5 May 2022 at 22:29, Sutou Kouhei wrote: > +1 > > Our next major release will be in July or August. I think > that pypa will drop support for manylinux2010 officially > when release a next major version. > > Thanks, > -- > kou > > In > "[DISC] (Python) Dropping suppo

[DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Raul Cumplido
Hi, I would like to propose a change in our release process. The rationale for the change is to avoid introducing new issues once a Release Candidate has already been cut by only merging specific commits to new release candidates. Currently once a new Release Candidate is required we drop the pr

Question: What should the offsets buffer be for an empty (list, binary, string) array?

2022-05-09 Thread Yang hao
For an empty (list, binary, string) array, what should the offsets buffer be? Empty buffer or a buffer containing a single zero? Or both are valid? There is some related information I found: 1. In the Apache Arrow Format: link