[GitHub] [arrow-datafusion] alamb opened a new issue, #6635: Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)`

via GitHub Sun, 11 Jun 2023 06:11:11 -0700


alamb opened a new issue, #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635


   ### Is your feature request related to a problem or challenge?
   
   A user on slack was using datafusion to query parquet files from S3: 
https://the-asf.slack.com/archives/C01QUFS30TD/p1686147411917959
   
   They reported that the following predicate got 1000x slower when it had 
100,000 distinct filter values:
   ```sql
   WHERE
   (rate.code = 'SC1' and rate.type = 'UT1' and rate.operation = 'RO1') 
   OR 
   (rate.code = 'SC2' and rate.type = 'UT2' and rate.operation = 'RO2')
   ...
   ```
   
   However, when it was rewritten as an inlist it went much faster:
   ```sql
   WHERE (rate.code || '_' || rate.type || '_' || rate.operation) IN 
('SC1_UT1_RO1', 'SC2_UT2_RO2', ...)
   ```
   
   However, this rewrite is not general (for example, if `code`, `type` or 
`operation` contain `_` characters.
   
   SQL supports this type of predicate natively with "multi-column inlists" 
that look like;
   
   ```sql
   WHERE (rate.code,  rate.type, rate.operation) IN (('SC1', 'UT1', 'RO1'), 
('SC2', 'UT2', 'RO2') ...)
   ```
   
    Substrait supports this kind of predicate too, which I take as some 
evidence it is widely used
   
   
https://substrait.io/expressions/specialized_record_expressions/#or-list-equality-expression
   
   ```
   > A specialized structure that is often used is a large list of possible 
values. In SQL, these are typically large IN lists. They can be composed from 
one or more fields. There are two common patterns, single value and multi 
value. In pseudocode they are represented as:
   ```
   
   ### Describe the solution you'd like
   
   I would like multi-column predicates to work in DataFusion. 
   
   today, they result in an "unimplemented" error:
   
   ```sql
   ❯ create table foo (x int, y varchar);
   0 rows in set. Query took 0.001 seconds.
   ❯ insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row in set. Query took 0.002 seconds.
   ❯ select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
   This feature is not implemented: Unsupported ast node in sqltorel: 
Tuple([Value(Number("1", false)), Value(SingleQuotedString("a"))])
   ```
   
   Here is an example showing this feature working in posgres:
   
   ```sql
   postgres=# create table foo (x int, y varchar);
   CREATE TABLE
   postgres=# insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
   INSERT 0 3
   postgres=# select * from foo;
    x | y
   ---+---
    1 | a
    2 | b
    3 | c
   (3 rows)
   postgres=# select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
    x | y
   ---+---
    1 | a
    2 | b
   (2 rows)
   ```
   
   ### Describe alternatives you've considered
   
   The existing InList structure looks like this: 
https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/expr/struct.InList.html
   
   ```rust
   pub struct InList {
       pub expr: Box<Expr>,
       pub list: Vec<Expr>,
       pub negated: bool,
   }
   ```
   
   I am not sure how best to implement this. One idea is to simply special case 
multi-inputs, something like 
   
   ```rust
   struct MultIInList {
       pub exprs: Vec<Expr>,
       pub lists: Vec<Vec<Expr>>,
       pub negated: bool,
   }
   ```
   
   However, my preferred approach would be to support `StructArray`s in 
`InList` and then implement a rewrite from
   
   ```sql
   (col1, col2, ...) in (('a1', 'a2', ..), ('b1', 'b2', ..), ...)
   ```
   into
   
   ```
   struct(col1, col2) In ({ col1: 'a', col2: 'a2', ..}, {col1: 'b1', col2: 
'b2', ..}, ...)
   ```
   
   While likely more complicated this approach would then support structs in 
INLISTs directly which I think will be more and more valuable over time
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #6635: Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)`

Reply via email to