xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2915131235

   Thanks for the suggestion @AlenkaF @raulcd  I just updated the code.
   
   btw, I observed two things while I am writing tests for this matter
   
   1. it seems filter cannot apply for both side, i.e same field for both 
table/schema, this was implemented in c++ side
   
https://github.com/apache/arrow/blob/0c1896854c060afb1754f2817f5a8821efd4ebcb/cpp/src/arrow/acero/hash_join_node.cc#L449-L451
   
   is this intended behavior?  
   
   
   for example, let's assume that we have two tables which have same schema 
(`id` and `name`), and we'd like to join them by id and then filter name with 
certain pattern. so without exposing this API to python, we probably need to 
maintain a big intermediate state of temp join and then apply the filter on top 
of it. 
   
   
   but if we can apply the filter on both tables first before we joining two 
tables, it would be more efficient? that's why I'd like to confirm what's the 
expected behavior for this filter in c++ implementation. 
   
   
   
   2. I tried to exercise join with filter, I saw following surprise. (assuming 
we use filter only on one side)
   
   ```python3
   In [54]: import pandas as pd
       ...: import pyarrow as pa
       ...: df1 = pd.DataFrame({'id': [1, 2, 3],
       ...:                     'year': [2020, 2022, 2019]})
       ...: df2 = pd.DataFrame({'id': [3, 4],
       ...:                     'n_legs': [5, 100],
       ...:                     'animal': ["Brittle stars", "Centipede"]})
       ...: t1 = pa.Table.from_pandas(df1)
       ...: t2 = pa.Table.from_pandas(df2)
   
   In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
   Out[55]:
   pyarrow.Table
   year: int64
   id: int64
   n_legs: int64
   animal: string
   ----
   year: [[2019,null]]
   id: [[3,4]]
   n_legs: [[5,100]]
   animal: [["Brittle stars","Centipede"]]
   
   # and then we apply filter expression with intended mismatch here
   In [56]: t1.join(t2, 'id', join_type="right outer", 
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
   Out[56]:
   pyarrow.Table
   year: int64
   id: int64
   n_legs: int64
   animal: string
   ----
   year: [[null,null]]
   id: [[3,4]]
   n_legs: [[5,100]]
   animal: [["Brittle stars","Centipede"]] 
   ```
   
   it seems we didn't return empty, instead, we return the `right outer`?  it 
seems the join type takes higher priority than filter   operation for the final 
result? 
   
   
   btw, it seems fine with inner join type.
   
   ```python3
   In [57]: t1.join(t2, 'id', join_type="inner", 
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
   Out[57]:
   pyarrow.Table
   id: int64
   year: int64
   n_legs: int64
   animal: string
   ----
   id: []
   year: []
   n_legs: []
   animal: []
   ```
   
   this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd  could 
you provide some feedback on these two questions? Thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to