xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2915131235
Thanks for the suggestion @AlenkaF @raulcd I just updated the code. btw, I observed two things while I am writing tests for this matter 1. it seems filter cannot apply for both side, i.e same field for both table/schema, this was implemented in c++ side https://github.com/apache/arrow/blob/0c1896854c060afb1754f2817f5a8821efd4ebcb/cpp/src/arrow/acero/hash_join_node.cc#L449-L451 is this intended behavior? for example, let's assume that we have two tables which have same schema (`id` and `name`), and we'd like to join them by id and then filter name with certain pattern. so without exposing this API to python, we probably need to maintain a big intermediate state of temp join and then apply the filter on top of it. but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation. 2. I tried to exercise join with filter, I saw following surprise. (assuming we use filter only on one side) ```python3 In [54]: import pandas as pd ...: import pyarrow as pa ...: df1 = pd.DataFrame({'id': [1, 2, 3], ...: 'year': [2020, 2022, 2019]}) ...: df2 = pd.DataFrame({'id': [3, 4], ...: 'n_legs': [5, 100], ...: 'animal': ["Brittle stars", "Centipede"]}) ...: t1 = pa.Table.from_pandas(df1) ...: t2 = pa.Table.from_pandas(df2) In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks() Out[55]: pyarrow.Table year: int64 id: int64 n_legs: int64 animal: string ---- year: [[2019,null]] id: [[3,4]] n_legs: [[5,100]] animal: [["Brittle stars","Centipede"]] # and then we apply filter expression with intended mismatch here In [56]: t1.join(t2, 'id', join_type="right outer", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks() Out[56]: pyarrow.Table year: int64 id: int64 n_legs: int64 animal: string ---- year: [[null,null]] id: [[3,4]] n_legs: [[5,100]] animal: [["Brittle stars","Centipede"]] ``` it seems we didn't return empty, instead, we return the `right outer`? it seems the join type takes higher priority than filter operation for the final result? btw, it seems fine with inner join type. ```python3 In [57]: t1.join(t2, 'id', join_type="inner", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks() Out[57]: pyarrow.Table id: int64 year: int64 n_legs: int64 animal: string ---- id: [] year: [] n_legs: [] animal: [] ``` this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd could you provide some feedback on these two questions? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org