Re: compute::is_in rejects duplicates in value_set

Wes McKinney Mon, 26 Apr 2021 13:31:49 -0700

In principle I don't see an issue with having duplicates in the value set,
could you open a Jira issue?


On Mon, Apr 26, 2021 at 3:27 PM Niranda Perera <niranda.per...@gmail.com>
wrote:

> Hi all,
>
> In the arrow release-4.0.0 branch, the compute::is_in operation rejects
> duplicate values in the value_set [1]. This was not the case in arrow 2.0
> >=.
>
> I was wondering if this strict restriction is required? Because ultimately,
> a hash set would be created from the value_set values, and there's no harm
> in having duplicates while doing so, isn't it?
> PS: I understand that the param name "value_set" indicates that the values
> need to be unique, but in the useability perspective, this can be relaxed
> IMO. ex: Pandas isin [2].
>
> Would like to know your thoughts on this?
>
> Best
>
> [1]
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc#L53
> [2]
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>

Re: compute::is_in rejects duplicates in value_set

Reply via email to