karlovnv commented on issue #7000: URL: https://github.com/apache/datafusion/issues/7000#issuecomment-2094813305
> Here is one idea for doing so: #9403 I thought over a join issue in case when left table may be not columnar. For instance let's consider `Events` and `Users` tables. `Events` is a columnar table and consist of 10^9 rows `Users` table is only of 10^6 rows So in case of that Users table may be considered as a row-based table with persistent (or stored only in memory) hash (or b*-tree) index. We can achieve performance boost using different approaches: 1. Introduce Dictionary feature. Consider `Users` table as a dictionary (like in [clickhouse](https://clickhouse.com/docs/en/sql-reference/dictionaries)) > ClickHouse supports special functions for working with dictionaries that can be used in queries. It is easier and more efficient to use dictionaries with functions than a JOIN with reference tables. Now we are playing with UDFs like ``` select timestamp, e.user_id, get_dict_utf8("Users", "Id", "Name", e.user_id) as user_name from events e ``` But this is not a kind if general solution so that leads us to the next approach. 2. Introduce row-based table provider with its special type of LookupRecordBatchStream The main idea is to add an ability of providing data to HashJoinStream by a request: `get_items_from_table_by_ids(join_on: RecordBatch) → Result<SendableRecordBatchStream>` Also this approach may be useful for joining columnar data with another relational source like Postgres (by loading portions of joining table data on demand by list of ids) in future. 3. Cache indices that have been built during JOIN execution or use an external user provided index -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
