Delta Lake support for DataFusion

2021-06-08 Thread Daniël Heres
Hi all,

I would like to receive some feedback about adding Delta Lake support to
DataFusion (https://github.com/apache/arrow-datafusion/issues/525).
As you might know, Delta Lake  is a format adding
features like ACID transactions, statistics, and storage optimization to
Parquet and is getting quite some traction for managing data lakes.
It seems a great feature to have in DataFusion as well.

The delta-rs  project provides a
native, Apache licensed, Rust implementation of Delta Lake, already
supporting a large part of the format and operations.

The first integration I would like to propose is adding read support via a
new TableProvider. There might be some work to do around dependencies as
both DataFusion and delta-rs rely on (certain versions of) Arrow and
Parquet.

Let me know if you have any further ideas or concerns.

Best regards,

Daniël Heres


Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Benjamin Kietzman
As a workaround, the "fill_null" compute function can be used to replace
nulls with nans:

>>> nan = pa.scalar(np.NaN, type=pa.float64())
>>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()

On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Li,
>
> It's correct that arrow uses "None" for null values when converting a
> string array to numpy / pandas.
> As far as I am aware, there is currently no option to control that
> (and to make it use np.nan instead), and I am not sure there would be
> much interest in adding such an option.
>
> Now, I know this doesn't give an exact roundtrip in this case, but
> pandas does treat both np.nan and None as missing values in object
> dtype columns, so behaviour-wise this shouldn't give any difference
> and the roundtrip is still faithful on that aspect.
>
> Best,
> Joris
>
> On Tue, 8 Jun 2021 at 21:59, Li Jin  wrote:
> >
> > Hello!
> >
> > Apologies if this has been brought before. I'd like to get devs' thoughts
> > on this potential inconsistency of "what are the python objects for null
> > values" between pandas and pyarrow.
> >
> > Demonstrated with the following example:
> >
> > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> pandas
> > 1.2.4):
> >
> > In [*32*]: df
> >
> > Out[*32*]:
> >
> >value
> >
> > key
> >
> > 1some_strign
> >
> >
> > In [*33*]: df2
> >
> > Out[*33*]:
> >
> > value2
> >
> > key
> >
> > 2some_other_string
> >
> >
> > In [*34*]: df.join(df2)
> >
> > Out[*34*]:
> >
> >value value2
> >
> > key
> >
> > 1some_strign*NaN*
> >
> >
> >
> > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> >
> > >>> s = pd.Series(["some_string", np.NaN])
> >
> > >>> s
> >
> > 0some_string
> >
> > 1NaN
> >
> > dtype: object
> >
> > >>> pa.Array.from_pandas(s).to_pandas()
> >
> > 0some_string
> >
> > 1   None
> >
> > dtype: object
> >
> >
> > I have looked around the pyarrow doc and didn't find an option to use
> > np.NaN for null values with to_pandas so it's a bit hard to get around
> trip
> > consistency.
> >
> >
> > I appreciate any thoughts on this as to how to achieve consistency here.
> >
> >
> > Thanks!
> >
> > Li
>


Re: C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-08 Thread Sutou Kouhei
Hi,

Could you try building Apache Arrow C++ with
-DCMAKE_BUILD_TYPE=Debug and get backtrace again? It will
show the source location on segmentation fault.

Thanks,
-- 
kou

In 
  "C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only" on Tue, 8 
Jun 2021 12:01:27 -0700,
  Rares Vernica  wrote:

> Hello,
> 
> We recently migrated our C++ Arrow code from 0.16 to 3.0.0. The code works
> fine on Ubuntu, but we get a segmentation fault in CentOS while reading
> Arrow Record Batch files. We can successfully read the files from Python or
> Ubuntu so the files and the writer are fine.
> 
> We use Record Batch Stream Reader/Writer to read/write data to files.
> Sometimes we use GZIP to compress the streams. The migration to 3.0.0 was
> pretty straight forward with minimal changes to the code
> https://github.com/Paradigm4/bridge/commit/03e896e84230ddb41bfef68cde5ed9b21192a0e9
> We have an extensive test suite and all is good on Ubuntu. On CentOS the
> write works OK but we get a segmentation fault during reading from C++. We
> can successfully read the files using PyArrow. Moreover, the files written
> by CentOS can be successfully read from C++ in Ubuntu.
> 
> Here is the backtrace I got form gdb when the segmentation fault occurred:
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7f548c7fb700 (LWP 2649)]
> 0x7f545c003340 in ?? ()
> (gdb) bt
> #0  0x7f545c003340 in ?? ()
> #1  0x7f54903377ce in arrow::ipc::ArrayLoader::GetBuffer(int,
> std::shared_ptr*) () from /lib64/libarrow.so.300
> #2  0x7f549034006c in arrow::Status
> arrow::VisitTypeInline(arrow::DataType const&,
> arrow::ipc::ArrayLoader*) () from /lib64/libarrow.so.300
> #3  0x7f5490340db4 in arrow::ipc::ArrayLoader::Load(arrow::Field
> const*, arrow::ArrayData*) () from /lib64/libarrow.so.300
> #4  0x7f5490318b5b in
> arrow::ipc::LoadRecordBatchSubset(org::apache::arrow::flatbuf::RecordBatch
> const*, std::shared_ptr const&, std::vector std::allocator > const*, arrow::ipc::DictionaryMemo const*,
> arrow::ipc::IpcReadOptions const&, arrow::ipc::MetadataVersion,
> arrow::Compression::type, arrow::io::RandomAccessFile*) () from
> /lib64/libarrow.so.300
> #5  0x7f549031952a in
> arrow::ipc::LoadRecordBatch(org::apache::arrow::flatbuf::RecordBatch
> const*, std::shared_ptr const&, std::vector std::allocator > const&, arrow::ipc::DictionaryMemo const*,
> arrow::ipc::IpcReadOptions const&, arrow::ipc::MetadataVersion,
> arrow::Compression::type, arrow::io::RandomAccessFile*) () from
> /lib64/libarrow.so.300
> #6  0x7f54903197ce in arrow::ipc::ReadRecordBatchInternal(arrow::Buffer
> const&, std::shared_ptr const&, std::vector std::allocator > const&, arrow::ipc::DictionaryMemo const*,
> arrow::ipc::IpcReadOptions const&, arrow::io::RandomAccessFile*) () from
> /lib64/libarrow.so.300
> #7  0x7f5490345d9c in
> arrow::ipc::RecordBatchStreamReaderImpl::ReadNext(std::shared_ptr*)
> () from /lib64/libarrow.so.300
> #8  0x7f549109b479 in scidb::ArrowReader::readObject
> (this=this@entry=0x7f548c7f7d80,
> name="index/0", reuse=reuse@entry=true, arrowBatch=std::shared_ptr (empty)
> 0x0) at XIndex.cpp:104
> #9  0x7f549109cb0a in scidb::XIndex::load (this=this@entry=0x7f545c003ab0,
> driver=std::shared_ptr (count 3, weak 0) 0x7f545c003e70, query=warning:
> RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace std::allocator, (__gnu_cxx::_Lock_policy)2>'
> warning: RTTI symbol not found for class
> 'std::_Sp_counted_ptr_inplace,
> (__gnu_cxx::_Lock_policy)2>'
> std::shared_ptr (count 7, weak 7) 0x7f546c005330) at XIndex.cpp:286
> 
> I also tried Arrow 4.0.0. The code compiled just fine and the behavior was
> the same, with the same backtrace.
> 
> The code where the segmentation fault occurs is trying to read a GZIP
> compressed Record Batch Stream. The file is 144 bytes and has only one
> column with three int64 values.
> 
>> file 0
> 0: gzip compressed data, from Unix
> 
>> stat 0
>   File: ‘0’
>   Size: 144   Blocks: 8  IO Block: 4096   regular file
> Device: 10302h/66306d Inode: 33715444Links: 1
> Access: (0644/-rw-r--r--)  Uid: ( 1001/   scidb)   Gid: ( 1001/   scidb)
> Context: unconfined_u:object_r:user_tmp_t:s0
> Access: 2021-06-08 04:42:28.653548604 +
> Modify: 2021-06-08 04:14:14.638927052 +
> Change: 2021-06-08 04:40:50.221279208 +
>  Birth: -
> 
> In [29]: s = pyarrow.input_stream('/tmp/bridge/foo/index/0',
> compression='gzip')
> In [30]: b = pyarrow.RecordBatchStreamReader(s)
> In [31]: t = b.read_all()
> In [32]: t.columns
> Out[32]:
> [
>  [
>[
>  0,
>  5,
>  10
>]
>  ]]
> 
> I removed the GZIP compression in both the writer and the reader but the
> issue persists. So I don't think it is because of the compression.
> 
> Here is the ldd on the library file which contains the reader and writers
> that use the Arrow library. It is built on a CentOS 7 with the g++ 4.9.2
> compiler.
> 
>> ldd libbridge.so
> linux-vdso.

Re: Arrow sync call May 26 at 12:00 US/Eastern, 16:00 UTC

2021-06-08 Thread Neal Richardson
Belated notes from the call last time:

Attendees:

Nate Bauernfeind
Ian Cook
Nic Crane
James Duong
Tiffany Lam
Jorge Cardoso Leitão
Rok Mihevc
Gyan Prakash
Neal Richardson

Discussion:

- 4.0.1 patch release: vote passed, doing the post release tasks
- FlightSQL: James and Tiffany picking back up on the Java side, will
follow up on Jira and elsewhere.


On Tue, May 25, 2021 at 12:34 PM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> Our biweekly call is tomorrow at https://meet.google.com/vtm-teks-phx.
> All are welcome to join. Notes will be shared with the mailing list
> afterward.
>
> Neal
>


Arrow sync call June 9 at 12:00 US/Eastern, 16:00 UTC

2021-06-08 Thread Neal Richardson
Hi all,
Our biweekly call is tomorrow at https://meet.google.com/vtm-teks-phx. All
are welcome to join. Notes will be shared with the mailing list afterward.

Neal


Re: [C++][Discuss] Switch to C++17

2021-06-08 Thread Jonathan Keane
I've been digging a bit to try and put numbers on those users the Neal
mentions. Specifically, we know that requiring C++17 will mean that R
users on windows using versions of R before 4.0.0 will not be able to
compile/install arrow. Although R version 3.6 is no longer supported
by CRAN [1], many people hang on to older versions for an extended
period of time.

We are still working on getting more solid numbers about how many
people might still be on these old versions, but here is what I have
so far:

Using Rstudio's cran mirror logs of package installations [2] (and
with the help of Arrow datasets to process/filter these files 🎉) for
the period from 2020-05-18 [3] to today, for the installations that
have an r version reported approximately 27% of the windows package
installs are on versions before 4.0.0 (and therefore would be unable
to install arrow if we require C++17 right now).

There are a number of caveats about this data, however:
* the "that have an r version reported" is very important: only ~17%
of the installations provide an R version. It's possible (and very
likely) that the installations that don't include this information are
not distributed like those that do. This is the biggest problem with
this dataset/analysis and we're trying to see if others have better
information here.
* This is limited to one of many cran repositories. There's no
indication that folks using this repository are more likely to be
using older versions (if anything it is probably the opposite), but we
don't have that information directly.
* There isn't a way to filter out CI and other automated installations
that aren't representative of real-world use cases.

If we get a more reliable dataset for this I will update these
numbers. I'm not sure what the threshold is for if this impacts too
many people (and if these numbers are above that). But wanted to get
this information out here for us to think about. Additionally, it
might be useful to think about how quickly we cut off support for
client languages: if we release on our typical schedule (in July),
people who installed R 1.25 years ago (on windows) would be required
to upgrade R in order to install arrow. That might be long enough, or
the benefits of C++17 outweigh this, but like Neal mentions: the
people likely to run into this are likely not on this list.


[1] - the last release in the 3.6 line (3.6.3) was released on
2020-02-29, and was superceded by 4.0.0 2020-04-24
[2] - http://cran-logs.rstudio.com
[3] - this is the day that R 4.1.0 was released and 3.6.0 stopped
being supported by CRAN

-Jon

On Tue, Jun 8, 2021 at 4:39 PM Neal Richardson
 wrote:
>
> I'm guessing there hasn't been opposition on this thread because the users
> that this might affect aren't following this mailing list.
>
> I'd be interested to see which other major C++ projects out there have
> bumped their requirement to C++17, and how that experience was for
> everyone--the user community as well as the developers. Do you know of good
> examples? I just checked on CRAN today, and of the 17,694 R packages there,
> only 3 require C++17 (none of which have wide adoption) and only 20 require
> C++14.
>
> Neal
>
> On Tue, Jun 8, 2021 at 6:17 AM Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > Note the change in the message topic :-)
> > We now have a draft PR up to switch the C++ standard level to C++17.
> > This allows very nice simplifications in the code, especially the use
> > of elegant constructs that can replace some cumbersome uses of
> > std::enable_if, SFINAE and other pain points.
> >
> > https://github.com/apache/arrow/pull/10414
> >
> > It seems we were finally able to overcome the main platform
> > compatibility (CI) hurdles, though some effort will probably be
> > necessary to squash all regressions in that area.
> >
> > I haven't seen any opposition previously in this thread, so you are
> > really concerned by this, it would be better to speak up quickly, as
> > otherwise we may decide to move forward with the change.
> >
> > Best regards
> >
> > Antoine.
> >
> >
> > On Thu, 27 May 2021 10:03:03 +0200
> > Antoine Pitrou  wrote:
> > > Hello,
> > >
> > > It seems the only two platforms that constrained us to C++11 will not be
> > > supported anymore (those platforms are RTools 3.5 for R packages, and
> > > manylinux1 for Python packages).
> > >
> > > It would be beneficial to bump our C++ requirement to C++14.  There is
> > > an issue open listing benefits:
> > > https://issues.apache.org/jira/browse/ARROW-12816
> > >
> > > An additional benefit is that some useful third-party libraries for us
> > > may or will require C++14, including in their headers.
> > >
> > > Is anyone opposed to doing the switch?  Please speak up.
> > >
> > > Best regards
> > >
> > > Antoine.
> > >
> >
> >
> >
> >


Re: [C++][Discuss] Switch to C++17

2021-06-08 Thread Neal Richardson
I'm guessing there hasn't been opposition on this thread because the users
that this might affect aren't following this mailing list.

I'd be interested to see which other major C++ projects out there have
bumped their requirement to C++17, and how that experience was for
everyone--the user community as well as the developers. Do you know of good
examples? I just checked on CRAN today, and of the 17,694 R packages there,
only 3 require C++17 (none of which have wide adoption) and only 20 require
C++14.

Neal

On Tue, Jun 8, 2021 at 6:17 AM Antoine Pitrou  wrote:

>
> Hello,
>
> Note the change in the message topic :-)
> We now have a draft PR up to switch the C++ standard level to C++17.
> This allows very nice simplifications in the code, especially the use
> of elegant constructs that can replace some cumbersome uses of
> std::enable_if, SFINAE and other pain points.
>
> https://github.com/apache/arrow/pull/10414
>
> It seems we were finally able to overcome the main platform
> compatibility (CI) hurdles, though some effort will probably be
> necessary to squash all regressions in that area.
>
> I haven't seen any opposition previously in this thread, so you are
> really concerned by this, it would be better to speak up quickly, as
> otherwise we may decide to move forward with the change.
>
> Best regards
>
> Antoine.
>
>
> On Thu, 27 May 2021 10:03:03 +0200
> Antoine Pitrou  wrote:
> > Hello,
> >
> > It seems the only two platforms that constrained us to C++11 will not be
> > supported anymore (those platforms are RTools 3.5 for R packages, and
> > manylinux1 for Python packages).
> >
> > It would be beneficial to bump our C++ requirement to C++14.  There is
> > an issue open listing benefits:
> > https://issues.apache.org/jira/browse/ARROW-12816
> >
> > An additional benefit is that some useful third-party libraries for us
> > may or will require C++14, including in their headers.
> >
> > Is anyone opposed to doing the switch?  Please speak up.
> >
> > Best regards
> >
> > Antoine.
> >
>
>
>
>


Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Joris Van den Bossche
Hi Li,

It's correct that arrow uses "None" for null values when converting a
string array to numpy / pandas.
As far as I am aware, there is currently no option to control that
(and to make it use np.nan instead), and I am not sure there would be
much interest in adding such an option.

Now, I know this doesn't give an exact roundtrip in this case, but
pandas does treat both np.nan and None as missing values in object
dtype columns, so behaviour-wise this shouldn't give any difference
and the roundtrip is still faithful on that aspect.

Best,
Joris

On Tue, 8 Jun 2021 at 21:59, Li Jin  wrote:
>
> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>value
>
> key
>
> 1some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
> value2
>
> key
>
> 2some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>value value2
>
> key
>
> 1some_strign*NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0some_string
>
> 1NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0some_string
>
> 1   None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li


Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Jorge Cardoso Leitão
Semantically, a NaN is defined according to the IEEE_754 for floating
points, while a null represents any value whose value is undefined,
unknown, etc.

An important set of problems that arrow solves is that it has a native
representation for null values (independent of NaNs): arrow's in-memory
model is designed ground up to support nulls; other in-memory
representations sometimes use NaN or some other variations to represent
nulls, which sometimes results in breaking memory alignments useful in
compute.

In Arrow, the value of a floating point array can be "non-null" or "null".
When non-null, it can be any valid value for the corresponding type. For
floats, that means any valid floating point number, including, NaN, inf,
-0.0, 0.0, etc.

Best,
Jorge



On Tue, Jun 8, 2021 at 9:59 PM Li Jin  wrote:

> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>value
>
> key
>
> 1some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
> value2
>
> key
>
> 2some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>value value2
>
> key
>
> 1some_strign*NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0some_string
>
> 1NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0some_string
>
> 1   None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li
>


Representation of "null" values for non-numeric types in Arrow/Pandas interop

2021-06-08 Thread Li Jin
Hello!

Apologies if this has been brought before. I'd like to get devs' thoughts
on this potential inconsistency of "what are the python objects for null
values" between pandas and pyarrow.

Demonstrated with the following example:

(1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
1.2.4):

In [*32*]: df

Out[*32*]:

   value

key

1some_strign


In [*33*]: df2

Out[*33*]:

value2

key

2some_other_string


In [*34*]: df.join(df2)

Out[*34*]:

   value value2

key

1some_strign*NaN*



(2) pyarrow seems to use "None" to represent a missing value (4.0.1)

>>> s = pd.Series(["some_string", np.NaN])

>>> s

0some_string

1NaN

dtype: object

>>> pa.Array.from_pandas(s).to_pandas()

0some_string

1   None

dtype: object


I have looked around the pyarrow doc and didn't find an option to use
np.NaN for null values with to_pandas so it's a bit hard to get around trip
consistency.


I appreciate any thoughts on this as to how to achieve consistency here.


Thanks!

Li


Re: [C++] Adopting a library for (distributed) tracing

2021-06-08 Thread David Li
I'll have to do some more digging into that and get back to you. So
far I've been using a quick-and-dirty tool that I whipped up using
Vega-Lite but that's probably not something we want to maintain. I
tried the Chrome trace viewer ("Catapult") but it's not quite built
for this kind of trace; I hear Jaeger's trace viewer can be used
standalone but needs some setup.

Though that does raise a good point: we should eventually have
documentation on this knob and how to use it.

-David

On 2021/06/08 19:21:16, Weston Pace  wrote: 
> FWIW, I tried this out yesterday since I was profiling the execution
> of the async API reader.  It worked great so +1 from me on that basis.
> I did struggle finding a good simple visualization tool.  Do you have
> any good recommendations on that front?
> 
> On Mon, Jun 7, 2021 at 10:50 AM David Li  wrote:
> >
> > Just to give an update on where this stands:
> >
> > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> > use it. This contains a few fixes I submitted for the platforms our
> > various CI jobs use, as well as an explicit build flag to support
> > header-only use - I think this should alleviate any concerns over it
> > adding to our build too much. I'm hopeful this means it can make it
> > into 5.0.0, at least with minimal functionality.
> >
> > For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> > have a chance to look through the PR and see if there's any places
> > where adding tracing may be useful.
> >
> > I also touched base with upstream about Python/C++ interop[2] - it
> > turns out upstream has thought about this before but doesn't have the
> > resources to pursue it at the moment, as the idea is to write an
> > API-compatible binding of the C++ library for Python (and presumably
> > R, Ruby, etc.) which is more work.
> >
> > Best,
> > David
> >
> > [1]: https://github.com/apache/arrow/pull/10260
> > [2]: https://github.com/open-telemetry/community/discussions/734
> >
> > On 2021/05/06 18:23:05, David Li  wrote:
> > > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > > [2]; I'd appreciate any feedback, particularly from anyone already
> > > trying to use OpenTelemetry/Tracing/Census with Arrow.
> > >
> > > For dependencies: now we use OpenTelemetry as header-only by
> > > default. I also slimmed down the build, avoiding making the build wait
> > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > > can be toggled via environment variable.
> > >
> > > For Python: the PR includes basic integration with Flight/Python. The
> > > C++ side will start a span, then propagate it to Python. Spans in
> > > Python will not propagate back to C++, and Python/C++ need to both set
> > > up their respective exporters. I plan to poke the upstream community
> > > about if there's a good solution to this kind of issue.
> > >
> > > For ABI compatibility: this will be an issue until upstream reaches
> > > 1.0. Even currently, there's an unreleased change on their main branch
> > > which will break the current PR when it's released. Hopefully, they
> > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > > to avoid shipping this until there is a 1.0. I have confirmed that
> > > linking an application which itself links OpenTelemetry to Arrow
> > > works.
> > >
> > > As for the overhead: I measured the impact on a dataset scan recording
> > > ~900 spans per iteration and there was no discernible effect on
> > > runtime compared to an uninstrumented scan (though again, this is not
> > > that many spans).
> > >
> > > Best,
> > > David
> > >
> > > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > > [2]: https://github.com/apache/arrow/pull/10260
> > >
> > > On 2021/05/01 19:53:45, "David Li"  wrote:
> > > > Thanks everyone for all the comments. Responding to a few things:
> > > >
> > > > > It seems to me it would be fairly implementation dependent -- so each
> > > > > language implementation would choose if it made sense for them and 
> > > > > then
> > > > > implement the appropriate connection to that language's open telemetry
> > > > > ecosystem.
> > > >
> > > > Agreed - I think the important thing is to agree on using OpenTelemetry 
> > > > itself so that the various Flight implementations, for instance, can 
> > > > all contribute compatible trace data. And there will be details like 
> > > > naming of keys for extra metadata we might want to attach, or trying to 
> > > > make (some) span names consistent.
> > > >
> > > > > My main question is: does integrating OpenTracing complicate our build
> > > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > > do you have to build it and link with it nonetheless?
> > > >
> > > > I need to look into this more and will follow up. I believe we can use 
> > > > it header-only. It's fairly simple to depend on (and has no re

Re: [C++] Adopting a library for (distributed) tracing

2021-06-08 Thread Weston Pace
FWIW, I tried this out yesterday since I was profiling the execution
of the async API reader.  It worked great so +1 from me on that basis.
I did struggle finding a good simple visualization tool.  Do you have
any good recommendations on that front?

On Mon, Jun 7, 2021 at 10:50 AM David Li  wrote:
>
> Just to give an update on where this stands:
>
> Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> use it. This contains a few fixes I submitted for the platforms our
> various CI jobs use, as well as an explicit build flag to support
> header-only use - I think this should alleviate any concerns over it
> adding to our build too much. I'm hopeful this means it can make it
> into 5.0.0, at least with minimal functionality.
>
> For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> have a chance to look through the PR and see if there's any places
> where adding tracing may be useful.
>
> I also touched base with upstream about Python/C++ interop[2] - it
> turns out upstream has thought about this before but doesn't have the
> resources to pursue it at the moment, as the idea is to write an
> API-compatible binding of the C++ library for Python (and presumably
> R, Ruby, etc.) which is more work.
>
> Best,
> David
>
> [1]: https://github.com/apache/arrow/pull/10260
> [2]: https://github.com/open-telemetry/community/discussions/734
>
> On 2021/05/06 18:23:05, David Li  wrote:
> > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > [2]; I'd appreciate any feedback, particularly from anyone already
> > trying to use OpenTelemetry/Tracing/Census with Arrow.
> >
> > For dependencies: now we use OpenTelemetry as header-only by
> > default. I also slimmed down the build, avoiding making the build wait
> > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > can be toggled via environment variable.
> >
> > For Python: the PR includes basic integration with Flight/Python. The
> > C++ side will start a span, then propagate it to Python. Spans in
> > Python will not propagate back to C++, and Python/C++ need to both set
> > up their respective exporters. I plan to poke the upstream community
> > about if there's a good solution to this kind of issue.
> >
> > For ABI compatibility: this will be an issue until upstream reaches
> > 1.0. Even currently, there's an unreleased change on their main branch
> > which will break the current PR when it's released. Hopefully, they
> > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > to avoid shipping this until there is a 1.0. I have confirmed that
> > linking an application which itself links OpenTelemetry to Arrow
> > works.
> >
> > As for the overhead: I measured the impact on a dataset scan recording
> > ~900 spans per iteration and there was no discernible effect on
> > runtime compared to an uninstrumented scan (though again, this is not
> > that many spans).
> >
> > Best,
> > David
> >
> > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > [2]: https://github.com/apache/arrow/pull/10260
> >
> > On 2021/05/01 19:53:45, "David Li"  wrote:
> > > Thanks everyone for all the comments. Responding to a few things:
> > >
> > > > It seems to me it would be fairly implementation dependent -- so each
> > > > language implementation would choose if it made sense for them and then
> > > > implement the appropriate connection to that language's open telemetry
> > > > ecosystem.
> > >
> > > Agreed - I think the important thing is to agree on using OpenTelemetry 
> > > itself so that the various Flight implementations, for instance, can all 
> > > contribute compatible trace data. And there will be details like naming 
> > > of keys for extra metadata we might want to attach, or trying to make 
> > > (some) span names consistent.
> > >
> > > > My main question is: does integrating OpenTracing complicate our build
> > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > do you have to build it and link with it nonetheless?
> > >
> > > I need to look into this more and will follow up. I believe we can use it 
> > > header-only. It's fairly simple to depend on (and has no required 
> > > dependencies), but it is a synchronous build step (you must build it to 
> > > have its headers available) - perhaps that could be resolved upstream or 
> > > I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry 
> > > to provide a few utilities (e.g. logging data to stdout as JSON), but 
> > > that could be split out into a libarrow_tracing.so if we keep them.
> > >
> > > > Also, are there ABI issues that may complicate integration into
> > > > applications that were compiled against another version of OpenTracing?
> > >
> > > Upstream already seems to be considering ABI compatibility. However, 
> > > until they reach 1.0, of course they need not keep any promises, and that 
>

C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-08 Thread Rares Vernica
Hello,

We recently migrated our C++ Arrow code from 0.16 to 3.0.0. The code works
fine on Ubuntu, but we get a segmentation fault in CentOS while reading
Arrow Record Batch files. We can successfully read the files from Python or
Ubuntu so the files and the writer are fine.

We use Record Batch Stream Reader/Writer to read/write data to files.
Sometimes we use GZIP to compress the streams. The migration to 3.0.0 was
pretty straight forward with minimal changes to the code
https://github.com/Paradigm4/bridge/commit/03e896e84230ddb41bfef68cde5ed9b21192a0e9
We have an extensive test suite and all is good on Ubuntu. On CentOS the
write works OK but we get a segmentation fault during reading from C++. We
can successfully read the files using PyArrow. Moreover, the files written
by CentOS can be successfully read from C++ in Ubuntu.

Here is the backtrace I got form gdb when the segmentation fault occurred:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f548c7fb700 (LWP 2649)]
0x7f545c003340 in ?? ()
(gdb) bt
#0  0x7f545c003340 in ?? ()
#1  0x7f54903377ce in arrow::ipc::ArrayLoader::GetBuffer(int,
std::shared_ptr*) () from /lib64/libarrow.so.300
#2  0x7f549034006c in arrow::Status
arrow::VisitTypeInline(arrow::DataType const&,
arrow::ipc::ArrayLoader*) () from /lib64/libarrow.so.300
#3  0x7f5490340db4 in arrow::ipc::ArrayLoader::Load(arrow::Field
const*, arrow::ArrayData*) () from /lib64/libarrow.so.300
#4  0x7f5490318b5b in
arrow::ipc::LoadRecordBatchSubset(org::apache::arrow::flatbuf::RecordBatch
const*, std::shared_ptr const&, std::vector > const*, arrow::ipc::DictionaryMemo const*,
arrow::ipc::IpcReadOptions const&, arrow::ipc::MetadataVersion,
arrow::Compression::type, arrow::io::RandomAccessFile*) () from
/lib64/libarrow.so.300
#5  0x7f549031952a in
arrow::ipc::LoadRecordBatch(org::apache::arrow::flatbuf::RecordBatch
const*, std::shared_ptr const&, std::vector > const&, arrow::ipc::DictionaryMemo const*,
arrow::ipc::IpcReadOptions const&, arrow::ipc::MetadataVersion,
arrow::Compression::type, arrow::io::RandomAccessFile*) () from
/lib64/libarrow.so.300
#6  0x7f54903197ce in arrow::ipc::ReadRecordBatchInternal(arrow::Buffer
const&, std::shared_ptr const&, std::vector > const&, arrow::ipc::DictionaryMemo const*,
arrow::ipc::IpcReadOptions const&, arrow::io::RandomAccessFile*) () from
/lib64/libarrow.so.300
#7  0x7f5490345d9c in
arrow::ipc::RecordBatchStreamReaderImpl::ReadNext(std::shared_ptr*)
() from /lib64/libarrow.so.300
#8  0x7f549109b479 in scidb::ArrowReader::readObject
(this=this@entry=0x7f548c7f7d80,
name="index/0", reuse=reuse@entry=true, arrowBatch=std::shared_ptr (empty)
0x0) at XIndex.cpp:104
#9  0x7f549109cb0a in scidb::XIndex::load (this=this@entry=0x7f545c003ab0,
driver=std::shared_ptr (count 3, weak 0) 0x7f545c003e70, query=warning:
RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class
'std::_Sp_counted_ptr_inplace,
(__gnu_cxx::_Lock_policy)2>'
std::shared_ptr (count 7, weak 7) 0x7f546c005330) at XIndex.cpp:286

I also tried Arrow 4.0.0. The code compiled just fine and the behavior was
the same, with the same backtrace.

The code where the segmentation fault occurs is trying to read a GZIP
compressed Record Batch Stream. The file is 144 bytes and has only one
column with three int64 values.

> file 0
0: gzip compressed data, from Unix

> stat 0
  File: ‘0’
  Size: 144   Blocks: 8  IO Block: 4096   regular file
Device: 10302h/66306d Inode: 33715444Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1001/   scidb)   Gid: ( 1001/   scidb)
Context: unconfined_u:object_r:user_tmp_t:s0
Access: 2021-06-08 04:42:28.653548604 +
Modify: 2021-06-08 04:14:14.638927052 +
Change: 2021-06-08 04:40:50.221279208 +
 Birth: -

In [29]: s = pyarrow.input_stream('/tmp/bridge/foo/index/0',
compression='gzip')
In [30]: b = pyarrow.RecordBatchStreamReader(s)
In [31]: t = b.read_all()
In [32]: t.columns
Out[32]:
[
 [
   [
 0,
 5,
 10
   ]
 ]]

I removed the GZIP compression in both the writer and the reader but the
issue persists. So I don't think it is because of the compression.

Here is the ldd on the library file which contains the reader and writers
that use the Arrow library. It is built on a CentOS 7 with the g++ 4.9.2
compiler.

> ldd libbridge.so
linux-vdso.so.1 =>  (0x7fffe4f1)
libarrow.so.300 => /lib64/libarrow.so.300 (0x7f8a38908000)
libaws-cpp-sdk-s3.so => /opt/aws/lib64/libaws-cpp-sdk-s3.so
(0x7f8a384b3000)
libm.so.6 => /lib64/libm.so.6 (0x7f8a381b1000)
librt.so.1 => /lib64/librt.so.1 (0x7f8a37fa9000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f8a37da5000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x7f8a37a9e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f8a37888000)
libc.so.6 => /lib64/libc.so.6 (0x7f8a374ba000)
libcrypto.so.10 => /lib64/libcrypto.so.10 (0x7f8a37057000)
libssl.so.10 

Re: [C++][DISCUSS] Implementing interpreted (non-compiled) tests for compute functions

2021-06-08 Thread Benjamin Kietzman
I've added https://issues.apache.org/jira/browse/ARROW-13013
to track moving kernel unit tests to Python since that seems easily
doable and worthwhile

On Sun, May 16, 2021 at 3:35 PM Wes McKinney  wrote:

> I agree there are pros and cons here (up front investment hopefully
> yielding future productivity gains). If a test harness and format
> meeting the requirements could be created without too much pain (I'm
> thinking less than a week of full time effort, including refactoring
> some existing tests), it would save developers a lot of time going
> forward. I look for example at the large number of hand-coded
> functional tests found in this file as one example to use to guide the
> effort:
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc
>
> On Sat, May 15, 2021 at 5:02 AM Antoine Pitrou  wrote:
> >
> >
> > I think people who think this would be beneficial should try to devise a
> > text format to represent compute test data.  As Eduardo pointed out,
> > there are various complications that need to be catered for.
> >
> > To me, it's not obvious that building the necessary infrastructure in
> > C++ to ingest that text format will be more pleasant than our current
> > way of writing tests.  As a data point, the JSON integration code in C++
> > is really annoying to maintain.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 15/05/2021 à 00:03, Wes McKinney a écrit :
> > > In C++, we have the "ArrayFromJSON" function which is an even simpler
> > > way of specifying input data compared with the integration tests.
> > > That's one possible starting point.
> > >
> > > The "interpreted tests" could be all specified and driven by minimal
> > > dependency Python code, as one possible way to approach things.
> > >
> > > On Fri, May 14, 2021 at 1:57 PM Jorge Cardoso Leitão
> > >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> (this problem also exists in Rust, btw)
> > >>
> > >> Couldn't we use something like we do for our integration tests?
> Create a
> > >> separate binary that would allow us to call e.g.
> > >>
> > >> test-compute --method equal --json-file  --arg "column1" --arg
> > >> "column2" --expected "column3"
> > >>
> > >> (or simply pass the input via stdin)
> > >>
> > >> and then use Python to call the binary?
> > >>
> > >> The advantage I see here is that we would compile the binary with
> flags to
> > >> disable unnecessary code, use debug, etc, thereby reducing compile
> time if
> > >> the kernel needs
> > >> to be changed.
> > >>
> > >> IMO our "json integration" format is a reliable way of passing data
> across,
> > >> it is very easy to read and write, and all our implementations can
> already
> > >> read it for integration tests.
> > >>
> > >> wrt to the "cross implementation", the equality operation seems a
> natural
> > >> candidate for across implementations checks, as that one has important
> > >> implications in all our integration tests. filter, take, slice,
> boolean ops
> > >> may also be easy to agree upon. "add" and the like are a bit more
> difficult
> > >> due to how overflow should be handled (abort vs saturate vs None), but
> > >> nothing that we can't take. ^_^
> > >>
> > >> Best,
> > >> Jorge
> > >>
> > >> On Fri, May 14, 2021 at 8:25 PM David Li  wrote:
> > >>
> > >>> I think even if it's not (easily) generalizable across languages,
> it'd
> > >>> still be a win for C++ (and hopefully languages that bind to
> > >>> C++). Also, I don't think they're meant to completely replace
> > >>> language-specific tests, but rather complement them, and make it
> > >>> easier to add and maintain tests in the overwhelmingly common case.
> > >>>
> > >>> I do feel it's somewhat painful to write these kinds of tests in C++,
> > >>> largely because of the iteration time and the difficulty of repeating
> > >>> tests across various configurations. I also think this could be an
> > >>> opportunity to leverage things like Hypothesis/property-based testing
> > >>> or perhaps fuzzing to make the kernels even more robust.
> > >>>
> > >>> -David
> > >>>
> > >>> On 2021/05/14 18:09:45, Eduardo Ponce  wrote:
> >  Another aspect to keep in mind is that some tests require internal
> > >>> options
> >  to be changed before executing the compute functions (e.g., check
> > >>> overflow,
> >  allow NaN comparisons, change validity bits, etc.). Also, there are
> tests
> >  that take randomized inputs and others make use of the min/max
> values for
> >  each specific data type. Certainly, these details can be generalized
> > >>> across
> >  languages/testing frameworks but not without careful treatment.
> > 
> >  Moreover, each language implementation still needs to test
> >  language-specific or internal functions, so having a meta test
> framework
> >  will not necessarily get rid of language-specific tests.
> > 
> >  ~Eduardo
> > 
> >  On Fri, May 14, 2021 at 1:56 PM Weston Pace 
> > >>> wrote:
> > 
> > >

Re: [ANNOUNCE] New Arrow committer: Kazuaki Ishizaki

2021-06-08 Thread Kazuaki Ishizaki
Thanks all for your messages and helps. I will work for the community 
together.

Best regards,
Kazuaki Ishizaki

Eduardo Ponce  wrote on 2021/06/09 00:03:35:

> From: Eduardo Ponce 
> To: dev@arrow.apache.org
> Date: 2021/06/09 00:04
> Subject: [EXTERNAL] Re: [ANNOUNCE] New Arrow committer: Kazuaki Ishizaki
> 
> Congratulations!!
> 
> ~Eduardo
> 
> On Mon, Jun 7, 2021 at 11:06 PM Fan Liya  wrote:
> 
> > Congratulations, Kazuaki!
> >
> > Best,
> > Liya Fan
> >
> > On Tue, Jun 8, 2021 at 7:59 AM Rok Mihevc  
wrote:
> >
> > > Congrats!
> > >
> > > On Tue, Jun 8, 2021 at 1:36 AM Micah Kornfield 

> > > wrote:
> > >
> > > > Congrats!
> > > >
> > > > On Monday, June 7, 2021, Bryan Cutler  wrote:
> > > >
> > > > > Congratulations!!
> > > > >
> > > > > On Sun, Jun 6, 2021, 7:28 PM Sutou Kouhei 
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > > > > Kazuaki Ishizaki has accepted an invitation to become a
> > > > > > committer on Apache Arrow. Welcome, and thank you for your
> > > > > > contributions!
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > kou
> > > > > >
> > > > >
> > > >
> > >
> >




Re: [ANNOUNCE] New Arrow committer: Kazuaki Ishizaki

2021-06-08 Thread Eduardo Ponce
Congratulations!!

~Eduardo

On Mon, Jun 7, 2021 at 11:06 PM Fan Liya  wrote:

> Congratulations, Kazuaki!
>
> Best,
> Liya Fan
>
> On Tue, Jun 8, 2021 at 7:59 AM Rok Mihevc  wrote:
>
> > Congrats!
> >
> > On Tue, Jun 8, 2021 at 1:36 AM Micah Kornfield 
> > wrote:
> >
> > > Congrats!
> > >
> > > On Monday, June 7, 2021, Bryan Cutler  wrote:
> > >
> > > > Congratulations!!
> > > >
> > > > On Sun, Jun 6, 2021, 7:28 PM Sutou Kouhei 
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > On behalf of the Arrow PMC, I'm happy to announce that
> > > > > Kazuaki Ishizaki has accepted an invitation to become a
> > > > > committer on Apache Arrow. Welcome, and thank you for your
> > > > > contributions!
> > > > >
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > kou
> > > > >
> > > >
> > >
> >
>


Re: Arrow Dataset API on Ceph

2021-06-08 Thread Jayjeet Chakraborty
Hi Yibo,

Thanks a lot for your interest in our work. Please refer to this [1] guide to 
deploy a complete environment on a cluster of nodes. Regarding your comment 
about a Ceph patch, the arrow object class that we implement is actually a 
plugin and does not require the Ceph source tree for building or maintaining 
it. It only requires the rados-objclass-dev package as a dependency. We provide 
the CMake option ARROW_CLS  to allow optional build of the object class plugin. 
Please let us know if you have any questions or comments. We really appreciate 
you taking the time to look into our project. Thanks.

Best,
Jayjeet

[1] 
https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/adapters/arrow-rados-cls/docs/deploy.md

On 2021/06/07 10:36:08, Yibo Cai  wrote: 
> Hi Jayjeet,
> 
> It is exciting to see a real world computational storage solution built
> upon Arrow and Ceph. Amazing work!
> 
> We are interesting in this project (I'm from Arm open source software
> team focusing on storage and big data OSS), and would like to reproduce
> your works first, then evaluate performance on Arm platform.
> 
> I went through your Arrow PR, it looks great. IIUC, there should be a
> corresponding Ceph patch implementing the object class with Arrow.
> 
> I wonder the best approach to deploy a complete environment for a quick
> evaluation. Any comment is welcomed. Thanks.
> 
> Yibo
> 
> On 6/2/21 3:42 AM, Jayjeet Chakraborty wrote:
> > Dear Arrow Community,
> > 
> > In our previous discussion, we planned on implementing a new Dataset API
> > like InMemoryDataset to interact with objects containing IPC data stored in
> > Ceph/RADOS . We had implemented this design and raised a
> > PR . But when we started adding
> > the dataset discovery functionality, we found ourselves reimplementing
> > filesystem abstractions and its metadata management. We closed the original
> > PR and raised a new PR  where
> > we redesigned our implementation to use the Ceph filesystem as our file I/O
> > interface since it provides fast metadata support via the Ceph metadata
> > servers (MDS). We also decided to store data using one of the file formats
> > supported by Arrow. One of our driving use cases favored Parquet.
> > 
> > Since we perform the scan operation inside the storage layer using Ceph
> > Object class
> > 
> > methods which need to be invoked directly on objects, we utilize the
> > striping strategy information provided by CephFS to translate filename in
> > CephFS to  object id in RADOS. To be able to have this one-to-one mapping,
> > we split Parquet files in a manner similar to how Spark splits Parquet
> > files for HDFS and ensure that each fragment is backed by a single RADOS
> > object.
> > 
> > We are planning a new PR, we extend the FileFormat interface to create a
> > RadosParquetFileFormat
> > 
> > interface that offloads Parquet file scan operations to the RADOS layer in
> > Ceph. Since we now utilize a filesystem interface, we can just use the
> > FileSystemDataset API and plug in our new format to offload scan
> > operations. We have also added Python bindings for the new APIs that we
> > implemented. In all, our patch only consists of around 3,000 LoC and
> > introduces new dependencies to Ceph’s librados and object class SDK only
> > (that can be disabled via cmake flags).
> > 
> > We have added an architecture
> > 
> > document with our PR which describes the overall architecture along with
> > the life of a dataset scan on using RadosParquet. Additionally, we recently
> > wrote up a paper  describing our design
> > and implementation along with some initial benchmarks given there. We plan
> > to raise a PR  to upstream our
> > format to apache/arrow soon and hence look forward to your comments and
> > thoughts on this new feature. Please let us know if you have any questions.
> > Thank you.
> > 
> > Best regards,
> > 
> > Jayjeet Chakraborty
> > 
> > On 2020/09/15 18:06:56, Micah Kornfield  wrote:
> >> gmock is already a dependency.  We haven't upgraded gmock/gtest in a
> > while,
> >> we might want to consider doing that (but this is orthogonal).
> >>
> >> On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou 
> > wrote:
> >>
> >>>
> >>> Hi Ivo,
> >>>
> >>> You can open a JIRA once you've got a PR ready.  No need to do it before
> >>> you think you're ready for submission.
> >>>
> >>> AFAIK, gmock is already a dependency.
> >>>
> >>> Regards
> >>>
> >

Re: [C++][Discuss] Switch to C++17

2021-06-08 Thread Antoine Pitrou


Hello,

Note the change in the message topic :-)
We now have a draft PR up to switch the C++ standard level to C++17.
This allows very nice simplifications in the code, especially the use
of elegant constructs that can replace some cumbersome uses of
std::enable_if, SFINAE and other pain points.

https://github.com/apache/arrow/pull/10414

It seems we were finally able to overcome the main platform
compatibility (CI) hurdles, though some effort will probably be
necessary to squash all regressions in that area.

I haven't seen any opposition previously in this thread, so you are
really concerned by this, it would be better to speak up quickly, as
otherwise we may decide to move forward with the change.

Best regards

Antoine.


On Thu, 27 May 2021 10:03:03 +0200
Antoine Pitrou  wrote:
> Hello,
> 
> It seems the only two platforms that constrained us to C++11 will not be 
> supported anymore (those platforms are RTools 3.5 for R packages, and 
> manylinux1 for Python packages).
> 
> It would be beneficial to bump our C++ requirement to C++14.  There is 
> an issue open listing benefits:
> https://issues.apache.org/jira/browse/ARROW-12816
> 
> An additional benefit is that some useful third-party libraries for us 
> may or will require C++14, including in their headers.
> 
> Is anyone opposed to doing the switch?  Please speak up.
> 
> Best regards
> 
> Antoine.
> 





Complex Number support in Arrow

2021-06-08 Thread Simon Perkins
Greetings Apache Dev Mailing List

I'm interested in adding complex number support to Arrow. The use case is
Radio Astronomy data, which is represented by complex values.

xref https://issues.apache.org/jira/browse/ARROW-638
xref https://github.com/apache/arrow/pull/10452

It's fairly easy to support Complex Numbers as a Python Extension -- see
for e.g. how I've done it here using a list(float{32,64}):

https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173

The above seems to work with the standard NumPy complex memory layout
(consecutive pairs of [real, imag] values) and should work with the C++
std::complex layout. Note that C complex and C++ std::complex should also
have the same layout https://stackoverflow.com/a/10540346.

However, this constrains this representation of Complex Numbers to the
dask-ms only. I think that it would be better to add support for this at a
base level in Arrow, especially since this will open up the ability for
other packages to understand the Complex Number Type. For example, it would
be useful to:

   1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> Pandas
   roundtrip. Currently there's no Pandas -> Arrow conversion for
   np.complex{64, 128}.
   2. Support complex number types in query engines like DataFusion and
   BlazingSQL, if only initially via selection on indexing columns.


I started up a PR in https://github.com/apache/arrow/pull/10452 adding
Complex Numbers as a first-class Arrow type, although I note that
https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
suggests implementing this as a C++ Extension Type on a first pass. Initial
experiments suggests this is pretty doable -- I've got some test cases
running already.

I have some questions going forward:

   - Adding first class complex types seems to involve modifying
   cpp/src/arrow/ipc/feather.fbs which may change the protocol and introduce
   breaking changes. I'm not sure about this and seek advice on how invasive
   this approach is and whether its worth pursuing.
   - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
   imagine a struct([real, imag]) might offer more in terms of affordance ot
   the user. I'd imagine the underlying memory layout would be the same.
   - I don't have a clear understanding of whether adding either a
   First-Class or ExtensionType involves supporting numeric operations on that
   type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
   whether Arrow is merely concerned with the underlying data representation.

Thanks for considering this.
  Simon Perkins