Re: Indexing, encoding, transformations and processing with PyArrow - GitHub 6284

Wes McKinney Mon, 27 Jan 2020 08:27:18 -0800

hi Athanassios,

I asked to move this discussion here because we use the dev@ and user@
mailing list for discussions (this is explained in the GitHub issue
template https://github.com/apache/arrow/blob/master/.github/ISSUE_TEMPLATE.md)


In the issue you cited inconsistent behavior with dictionary_encode --
we don't consider this to be inconsistent, see this Jupyter notebook

https://gist.github.com/wesm/2e29b7724571d5251051189846bfa99c

NumPy coerces None to NaN in numpy.array. In pyarrow.array, None
becomes null for all data types. However, NaN is not a null sentinel
in Apache Arrow like it is in pandas, so it is treated as a valid
floating point value in algorithms like dictionary_encode. Given that
if you need null and NaN to be handled equivalently in your system you
may indeed need to maintain some custom code if there isn't anything
in the project that does precisely what you need.

- Wes


On Mon, Jan 27, 2020 at 8:55 AM Athanassios I. Hatzis
<[email protected]> wrote:
>
> Hi, recently I have started experimenting with PyArrow for the needs of my 
> TRIADB project. Kudos to
> Wes and his team on leading one of the best open-source IT projects in data 
> engineering. Definitely
> a wise decision to continue the success story of Pandas on the right track !
>
> At this stage I am trying to make a new release of TRIADB that will handle 
> metadata management and
> fast ingestion of data in memory for transformations and basic query 
> operations.
>
> Secondary index, dictionary encoding and adjacency lists are a core part of 
> TRIADB project, that is
> the reason I posted the issue with Array.dictionary_encode method (
> https://github.com/apache/arrow/issues/6284). Isn't my example and description
> clear ? What exactly would you like me to elaborate on ?
>
> I also noticed that there is NumPy integration and you can convert easily 
> from NumPy to Arrow but
> the reverse direction has several limitations. For example I cannot create 
> view for StringArray
> (NotImplementedError: NumPy array view is only supported for primitive 
> types). But string() (utf8)
> is in the list of your primitive types. Any plans for supporting this type 
> with NumPy soon ?
>
> Kind regards
> Athanassios
>
>

Re: Indexing, encoding, transformations and processing with PyArrow - GitHub 6284

Reply via email to