[GitHub] [arrow-datafusion] ParadoxShmaradox opened a new issue, #2845: [Question] Optimize multiple reads on same DataFrame

GitBox Wed, 06 Jul 2022 11:03:07 -0700


ParadoxShmaradox opened a new issue, #2845:
URL: https://github.com/apache/arrow-datafusion/issues/2845


   Hey,
   
   I have a scenario where I have to run the same filter expression but with 
different values on the same RecordBatch
   
   For example
   
   ```
   let c2: Vec<RecordBatch> = ....
   let provider = datafusion::datasource::MemTable::try_new(c2[0].schema(), 
vec![c2])
       .map_err(|e| {
           log::error!("Error MemTable {}", e);
           e
       })
       .unwrap();
   
   let ctx = SessionContext::new();
   
   ctx.register_table("t", provider ).unwrap();
   let df = ctx.table("t").unwrap();
   
   let expr: Expr = get_expression(id, from_time, to_time)
   
   let df = df.filter(expr).unwrap();
   
   let res = df.collect().await.unwrap();
   ctx.deregister_table("t").unwrap();
   ```
   
   It is pretty fast, a few ms on a 80MiB in-memory array with filtering on 2 
columns.
   I might run 1000 queries on the same MemTable and was wondering if there is 
anything that could be optimized:
   
   - pre computing an execution plan on the MemTable if it's cost effective
   - Is SessionContext thread safe and shareable between multiple threads and 
be optimized across executions?
   - Somehow create an index (not sure if an index is created by one of the 
calls or supported at all) if it's cost effective
   
   Thanks!
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ParadoxShmaradox opened a new issue, #2845: [Question] Optimize multiple reads on same DataFrame

Reply via email to