[GitHub] [incubator-doris] huangmengbin opened a new issue #6098: [Enhancement] 改善 RowCursor::init_scan_key 的性能

GitBox Sat, 26 Jun 2021 05:22:13 -0700


huangmengbin opened a new issue #6098:
URL: https://github.com/apache/incubator-doris/issues/6098



   **Is your feature request related to a problem? Please describe.**
   - 
每一次`olap/row_cursor`中的`RowCursor::init_scan_key`方法被调用，都会根据传入的`TabletSchema`和`scan_keys的长度`构建出一个新的`Schema`
   - 而实际上，`init_scan_key`都是在for循环内被调用，且内部构建出的`Schema`都是相同的
   - 目前的`RowCursor`是不会去修改`Schema`的，对外返回的也是const指针
   - 如果对高维表使用 "where a in (...) and b in (...) 
......"的查询语句，且最终构建出的笛卡尔积过大（某些使用场景确实需要较大的`max_scan_key_num`），将会导致无效内存分配过大，锁竞争激烈
   
   **Describe the solution you'd like**
   - 
从外界(循环外)提前构建好`Schema`且包装入shared_ptr中，传入循环内的多个`init_scan_key`方法之中，使多个`RowCursor`共享同一个`Schema`，减少对象构造与析构、内存分配与去配、锁竞争的开销
   
   **Describe alternatives you've considered**
   - `olap/schema`中，有一个`std::vector<Field*> 
_cols`，其size为`TabletSchema`的列数，记录所有被`_col_ids`覆盖到的列，无效的列将用nullptr来填充。猜测这是为了以空间换时间
   - 
实际上，针对上述场景，`TabletSchema`的列数往往是远超于`scan_keys`的大小的。因此没必要为`Schema`开辟`TabletSchema`的列数那么大的空间，只需让其大小能够容纳_col_ids中的最大值即可，也就是保证
 `_cols.size()>=max(_col_ids)+1` 
   - 然后在`Schema::column(ColumnId 
cid)`方法中加入特判语句，如果cid超过被裁剪后的size，则直接返回nullptr。这样可减少内存占用，且对外界的代码是透明无感的
   
   
   **Additional context**
   - 一般来说，`NumScanKeys`越大、列数越多、`_scan_ranges`越多，优化效果越明显
   - 如果第一种方法能通过，则暂时不采取第二种方法；后者其带来的性能提升不如前者
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] huangmengbin opened a new issue #6098: [Enhancement] 改善 RowCursor::init_scan_key 的性能

Reply via email to