hi Athanassios, I asked to move this discussion here because we use the dev@ and user@ mailing list for discussions (this is explained in the GitHub issue template https://github.com/apache/arrow/blob/master/.github/ISSUE_TEMPLATE.md)
In the issue you cited inconsistent behavior with dictionary_encode -- we don't consider this to be inconsistent, see this Jupyter notebook https://gist.github.com/wesm/2e29b7724571d5251051189846bfa99c NumPy coerces None to NaN in numpy.array. In pyarrow.array, None becomes null for all data types. However, NaN is not a null sentinel in Apache Arrow like it is in pandas, so it is treated as a valid floating point value in algorithms like dictionary_encode. Given that if you need null and NaN to be handled equivalently in your system you may indeed need to maintain some custom code if there isn't anything in the project that does precisely what you need. - Wes On Mon, Jan 27, 2020 at 8:55 AM Athanassios I. Hatzis <[email protected]> wrote: > > Hi, recently I have started experimenting with PyArrow for the needs of my > TRIADB project. Kudos to > Wes and his team on leading one of the best open-source IT projects in data > engineering. Definitely > a wise decision to continue the success story of Pandas on the right track ! > > At this stage I am trying to make a new release of TRIADB that will handle > metadata management and > fast ingestion of data in memory for transformations and basic query > operations. > > Secondary index, dictionary encoding and adjacency lists are a core part of > TRIADB project, that is > the reason I posted the issue with Array.dictionary_encode method ( > https://github.com/apache/arrow/issues/6284). Isn't my example and description > clear ? What exactly would you like me to elaborate on ? > > I also noticed that there is NumPy integration and you can convert easily > from NumPy to Arrow but > the reverse direction has several limitations. For example I cannot create > view for StringArray > (NotImplementedError: NumPy array view is only supported for primitive > types). But string() (utf8) > is in the list of your primitive types. Any plans for supporting this type > with NumPy soon ? > > Kind regards > Athanassios > >
