Re: post-release tasks (4.0.1)

2021-06-09 Thread Jorge Cardoso Leitão
I have been unable to generate the docs from any of my two machines (my macbook and a VM on azure), and I do not think we should delay this further. Could someone kindly create a PR with the generated docs to the website? I think that the command amounts to "dev/release/post-09-docs.sh 4.0.1".

Re: C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-09 Thread Sutou Kouhei
Hi, > Then I went back to the pre-built binaries for 3.0.0 and 4.0.0 from JFrog > and the issue reappeared. I can only infer that it has to do with the way > the pre-built binaries are generated... The pre-built binaries are the official RPM packages, right? They are built with the default

Re: C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-09 Thread Rares Vernica
I got the apache-arrow-4.0.1 source and compiled it with the Debug flag. No segmentation fault occurred. I then removed the Debug flag and still no segmentation fault. I then tried the 4.0.0 source. Still no issues. Finally, I tried the 3.0.0 source and still no issues. Then I went back to the

Re: Complex Number support in Arrow

2021-06-09 Thread Micah Kornfield
Hi Antoine, In regards to conceptual simplicity, I might have misinterpreted when you wrote: Since complex numbers are quite common in some domains, and since they > are conceptually simply, It seemed like a justification for adding them as a first class type. Thanks, Micah On Wed, Jun 9,

Re: Complex Number support in Arrow

2021-06-09 Thread Antoine Pitrou
Le 10/06/2021 à 00:05, Micah Kornfield a écrit : While dedicated types are not strictly required, compute functions would be much easier to add for a first-class dedicated complex datatype rather than for an extension type. It seems like maybe this is an area to focus on? I'm not sure

Re: Complex Number support in Arrow

2021-06-09 Thread Micah Kornfield
> > While dedicated types are not strictly required, compute functions would > be much easier to add for a first-class dedicated complex datatype > rather than for an extension type. It seems like maybe this is an area to focus on? I'm not sure conceptually simple is the right criteria to apply

Re: Complex Number support in Arrow

2021-06-09 Thread Wes McKinney
I think that having a top-level type for complex numbers would be nicer than extension types, so it would look like table Complex { precision: Precision; } and the representation is a packed tuple of two floating point numbers of the indicated precision (I think this is the standard way that

Re: Delta Lake support for DataFusion

2021-06-09 Thread Andrew Lamb
> And probably some more I don't think of currently. I think this is useful > work as it also would enable other "extensions" to work in a similar way I 100% agree On Wed, Jun 9, 2021 at 2:30 PM Daniël Heres wrote: > Thanks all for the valuable input! > > I agree following the plugin / model

Re: Delta Lake support for DataFusion

2021-06-09 Thread Daniël Heres
Thanks all for the valuable input! I agree following the plugin / model makes a lot of sense for now (either in arrow-datafusion repo or somewhere external, for example in delta-rs if we're OK it not being part of Apache right now). In order to support certain Delta Lake features including SQL

Re: Delta Lake support for DataFusion

2021-06-09 Thread Neville Dipale
The correct approach might be to improve DataFusion support in delta-rs. TableProvider is already implemented here: https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs I've pinged QP to ask for their advice. Neville On Wed, 9 Jun 2021 at 19:58, Andrew Lamb wrote: > I

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Neal Richardson
Responding to Antoine's specific questions: * 429 R packages on CRAN list C++11 as a SystemRequirement. These numbers may be a slight undercount because the SystemRequirements field is not machine-read. Some packages (e.g. https://github.com/eddelbuettel/rcppsimdjson/) appear to actually require

Re: Delta Lake support for DataFusion

2021-06-09 Thread Andrew Lamb
I think the idea of DataFusion + DeltaLake is quite compelling and likely useful. However, I think DataFusion is ideally an "embeddable query engine" rather than a database system in itself, so in that mental model Delta Lake integration belongs somewhere other than the core DataFusion crate.

Re: Complex Number support in Arrow

2021-06-09 Thread Antoine Pitrou
Le 09/06/2021 à 17:52, Micah Kornfield a écrit : Adding a new first-class type in Arrow requires working integration tests between C++ and Java libraries (once the idea is informally agreed upon) and then a final vote for approval. We haven't formalized extension types but I imagine a

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Benjamin Kietzman
One improvement in read/writability which might be my favorite is the removal of SFINAE-controlled template instantiation in favor of compile time branching with `if constexpr`. Here's an example of that in the draft PR:

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
Le 09/06/2021 à 19:25, Eduardo Ponce a écrit : Measurable metrics: * code size (source and binary) - measured in bytes [...] Qualitative metrics: * code structure/maintainability - how would it improve development? * code readability - ease of understanding details for new/current

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Eduardo Ponce
After the discussion in today's Arrow sync call, I do think it would be beneficial to come up with a formal process for deciding when is a "right time" for upgrading Arrow to a newer C++ standard. I suggest we could consider a set of general metrics/criteria that try to summarize the benefits and

Re: [C++] Adopting a library for (distributed) tracing

2021-06-09 Thread David Li
I just updated the PR with support for exporting to Jaeger[1], which has a built in trace viewer. 1. Download and run the all-in-one Jaeger binary locally[2] (or their Docker image) 2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON` 3. Run your application with `env

Re: Complex Number support in Arrow

2021-06-09 Thread Micah Kornfield
Hi Simon, Please see a recent discussion on adding new types [1] - Adding first class complex types seems to involve modifying >cpp/src/arrow/ipc/feather.fbs which may change the protocol and > introduce >breaking changes. I'm not sure about this and seek advice on how > invasive >

Re: Delta Lake support for DataFusion

2021-06-09 Thread Jorge Cardoso Leitão
Hi, Some questions that come to mind: 1. If we add vendor X to datafusion, will we be open to other vendor Y? How do we compare vendors? How do we draw the line of "not sufficiently relevant"? 2. How do we ensure that we do not distort the same level playing field that some people expect from

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
On Tue, 8 Jun 2021 17:37:30 -0500 Jonathan Keane wrote: > I've been digging a bit to try and put numbers on those users the Neal > mentions. Specifically, we know that requiring C++17 will mean that R > users on windows using versions of R before 4.0.0 will not be able to > compile/install arrow.

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
On Tue, 8 Jun 2021 14:39:27 -0700 Neal Richardson wrote: > I'm guessing there hasn't been opposition on this thread because the users > that this might affect aren't following this mailing list. > > I'd be interested to see which other major C++ projects out there have > bumped their requirement

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-09 Thread Wes McKinney
To my knowledge, "None" has always been the preferred null sentinel value for object-dtype arrays in pandas, but since sometimes these arrays originate from transposes or other join/append operations that merge numeric arrays (which have NaN sentinels) into non-numeric arrays to create object

Re: Moving automated nightly build e-mails to a separate mailing list

2021-06-09 Thread Antoine Pitrou
Hello, bui...@arrow.apache.org now also has a GMane mirror at gmane.comp.apache.arrow.builds. Regards Antoine. On Sun, 23 May 2021 08:13:37 -0700 Wes McKinney wrote: > hi folks, > > In an effort to increase the signal-to-noise ratio on dev@, I suggest > that we move the [NIGHTLY] e-mails

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-09 Thread Joris Van den Bossche
That won't help in this specific case, since it is for an array of strings (which you can't fill with NaN), and for floating point arrays, we already use np.nan as "null" representation when converting to numpy/pandas. On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman wrote: > > As a workaround,

Delta Lake support for DataFusion

2021-06-09 Thread Daniël Heres
Hi all, I would like to receive some feedback about adding Delta Lake support to DataFusion (https://github.com/apache/arrow-datafusion/issues/525). As you might know, Delta Lake is a format adding features like ACID transactions, statistics, and storage optimization to