[GitHub] [arrow-datafusion] JasonLi-cn commented on issue #2199: Morsel-Driven Parallelism Using Rayon

GitBox Fri, 30 Sep 2022 00:48:33 -0700


JasonLi-cn commented on issue #2199:
URL: 
https://github.com/apache/arrow-datafusion/issues/2199#issuecomment-1263228061


   1. binary code
   
   ```rust
   use datafusion::arrow::record_batch::RecordBatch;
   use datafusion::arrow::util::pretty::print_batches;
   use datafusion::error::Result;
   use datafusion::prelude::*;
   use datafusion::scheduler::Scheduler;
   use futures::{StreamExt, TryStreamExt};
   use std::env;
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let name = "test_table";
       let mut args = env::args();
       args.next();
       let table_path = args.next().expect("parquet file");
       let sql = &args.next().expect("sql");
       let using_scheduler = args.next().is_some();
   
       // create local session context
       let config = SessionConfig::new()
           .with_information_schema(true)
           .with_target_partitions(4);
       let context = SessionContext::with_config(config);
   
       // register parquet file with the execution context
       context
           .register_parquet(name, &table_path, ParquetReadOptions::default())
           .await?;
   
       let task = context.task_ctx();
       let query = context.sql(sql).await.unwrap();
       let plan = query.create_physical_plan().await.unwrap();
   
       println!("Start query, using scheduler {}", using_scheduler);
       let now = std::time::Instant::now();
       let results = if using_scheduler {
           let scheduler = Scheduler::new(4);
           let stream = scheduler.schedule(plan, task).unwrap().stream();
           let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
           results
       } else {
           context.sql(sql).await?.collect().await?
       };
       let elapsed = now.elapsed().as_millis();
       println!("End query, elapsed {} ms", elapsed);
       print_batches(&results)?;
       Ok(())
   }
   
   /// Execute sql
   async fn plan_and_collect(
       context: &SessionContext,
       sql: &str,
   ) -> Result<Vec<RecordBatch>> {
       context.sql(sql).await?.collect().await
   }
   ```
   
   2. test data
   
   - format: parquet
   - number of files: 4
   - rows: 16405852 * 4 = 65623408
   - number of columns: 6
   - schema: uint32, uint32, uint32, uint32, string, uint32
   
   3. test result
   
   SQLs:
   ```sql
   select count(distinct column0) from test_table;
   select * from test_table order by column5 limit 10;
   ```
   The performance is similar with and without the Scheduler! Is there a 
problem with where I use it?
   
   @tustvold 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] JasonLi-cn commented on issue #2199: Morsel-Driven Parallelism Using Rayon

Reply via email to