Greetings Apache Dev Mailing List

I'm interested in adding complex number support to Arrow. The use case is
Radio Astronomy data, which is represented by complex values.

xref https://issues.apache.org/jira/browse/ARROW-638
xref https://github.com/apache/arrow/pull/10452

It's fairly easy to support Complex Numbers as a Python Extension -- see
for e.g. how I've done it here using a list(float{32,64}):

https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173

The above seems to work with the standard NumPy complex memory layout
(consecutive pairs of [real, imag] values) and should work with the C++
std::complex layout. Note that C complex and C++ std::complex should also
have the same layout https://stackoverflow.com/a/10540346.

However, this constrains this representation of Complex Numbers to the
dask-ms only. I think that it would be better to add support for this at a
base level in Arrow, especially since this will open up the ability for
other packages to understand the Complex Number Type. For example, it would
be useful to:

   1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> Pandas
   roundtrip. Currently there's no Pandas -> Arrow conversion for
   np.complex{64, 128}.
   2. Support complex number types in query engines like DataFusion and
   BlazingSQL, if only initially via selection on indexing columns.


I started up a PR in https://github.com/apache/arrow/pull/10452 adding
Complex Numbers as a first-class Arrow type, although I note that
https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
suggests implementing this as a C++ Extension Type on a first pass. Initial
experiments suggests this is pretty doable -- I've got some test cases
running already.

I have some questions going forward:

   - Adding first class complex types seems to involve modifying
   cpp/src/arrow/ipc/feather.fbs which may change the protocol and introduce
   breaking changes. I'm not sure about this and seek advice on how invasive
   this approach is and whether its worth pursuing.
   - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
   imagine a struct([real, imag]) might offer more in terms of affordance ot
   the user. I'd imagine the underlying memory layout would be the same.
   - I don't have a clear understanding of whether adding either a
   First-Class or ExtensionType involves supporting numeric operations on that
   type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
   whether Arrow is merely concerned with the underlying data representation.

Thanks for considering this.
  Simon Perkins

Reply via email to