karlovnv commented on issue #7000:
URL: https://github.com/apache/datafusion/issues/7000#issuecomment-2094813305

   > Here is one idea for doing so: #9403
   
   I thought over a join issue in case when left table may be not columnar. 
   
   For instance let's consider `Events` and `Users` tables.
   `Events` is a columnar table and consist of 10^9 rows
   `Users` table is only of 10^6 rows
   
   So in case of that Users table may be considered as a row-based table with 
persistent (or stored only in memory) hash (or b*-tree) index. 
   
   We can achieve performance boost using different approaches:
   
   1.  Introduce Dictionary feature. Consider `Users` table as a dictionary 
(like in 
[clickhouse](https://clickhouse.com/docs/en/sql-reference/dictionaries))
   
   > ClickHouse supports special functions for working with dictionaries that 
can be used in queries. It is easier and more efficient to use dictionaries 
with functions than a JOIN with reference tables.
   
   Now we are playing with UDFs like 
   ```
   select timestamp, 
     e.user_id,  
     get_dict_utf8("Users", "Id", "Name", e.user_id) as user_name 
   from events e
   ```
   
   But this is not a kind if general solution so that leads us to the next 
approach.
   
   2. Introduce row-based table provider with its special type of 
LookupRecordBatchStream
   The main idea is to add an ability of providing data to HashJoinStream by a 
request:
   `get_items_from_table_by_ids(join_on: RecordBatch) → 
Result<SendableRecordBatchStream>`
   
   Also this approach may be useful for joining columnar data with another 
relational source like Postgres (by loading portions of joining table data on 
demand by list of ids) in future.
   
   3. Cache indices that have been built during JOIN execution or use an 
external user provided index
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to