Hi all, In the arrow release-4.0.0 branch, the compute::is_in operation rejects duplicate values in the value_set [1]. This was not the case in arrow 2.0 >=.
I was wondering if this strict restriction is required? Because ultimately, a hash set would be created from the value_set values, and there's no harm in having duplicates while doing so, isn't it? PS: I understand that the param name "value_set" indicates that the values need to be unique, but in the useability perspective, this can be relaxed IMO. ex: Pandas isin [2]. Would like to know your thoughts on this? Best [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc#L53 [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>