Hi Matthew, I went ahead and created a PR to add this to our benchmark suite. I will aim to finish this over the weekend.
On Thu, Dec 10, 2020 at 3:11 PM Matthew Turner <[email protected]> wrote: > Hello, > > > > I’ve been playing around with DataFusion to explore the feasibility of > replacing current python/pandas data processing jobs with Rust/datafusion. > Ultimately, looking to improve performance / decrease cost. > > > > I was doing some simple tests to start to measure performance differences > on a simple task (read a csv[1] and filter it). > > > > Reading the csv datafusion seemed to outperform pandas by around 30% which > was nice. > > *Rust took around 20-25ms to read the csv (compared to 32ms from pandas) > > > > However, when filtering the data I was surprised to see that pandas was > way faster. > > *Rust took around 500-600ms to filter the csv(compared to 1ms from pandas) > > > > My code for each is below. I know I should be running the DataFusion > times through something similar to pythons %timeit but I didn’t have that > immediately accessible and I ran many times to confirm it was roughly > consistent. > > > > Is this performance expected? Or am I using datafusion incorrectly? > > > > Any insight is much appreciated! > > > > [Rust] > > ``` > > use datafusion::error::Result; > > use datafusion::prelude::*; > > use std::time::Instant; > > > > #[tokio::main] > > async fn main() -> Result<()> { > > let start = Instant::now(); > > > > let mut ctx = ExecutionContext::new(); > > > > let ratings_csv = "ratings_small.csv"; > > > > let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap(); > > println!("Read CSV Duration: {:?}", start.elapsed()); > > > > let q_start = Instant::now(); > > let results = df > > .filter(col("userId").eq(lit(1)))? > > .collect() > > .await > > .unwrap(); > > println!("Filter duration: {:?}", q_start.elapsed()); > > > > println!("Duration: {:?}", start.elapsed()); > > > > Ok(()) > > } > > ``` > > > > [Python] > > ``` > > In [1]: df = pd.read_csv(“ratings_small.csv”) > > 32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > > > > In [2]: df.query(“userId==1”) > > 1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > > ``` > > > > [1]: > https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv > > > > > > *Matthew M. Turner* > > Email*:* [email protected] > > Phone: (908)-868-2786 > > >
