oops,. I managed to hit a magical key combination that sent the mail prematurely. Let's try that again.
Hi Matthew, I went ahead and created a PR to add this to our benchmark suite. I will aim to finish this over the weekend. https://github.com/apache/arrow/pull/8896 If I run in debug mode with "cargo run --bin movies" I get a timing of 526 ms, which seems similar to the timing you are seeing. If I run in release mode with "cargo run --release --bin movies" then the time drops down to 21 ms. Are you running in release mode? Thanks, Andy. On Thu, Dec 10, 2020 at 10:46 PM Andy Grove <[email protected]> wrote: > Hi Matthew, > > I went ahead and created a PR to add this to our benchmark suite. I will > aim to finish this over the weekend. > > > On Thu, Dec 10, 2020 at 3:11 PM Matthew Turner < > [email protected]> wrote: > >> Hello, >> >> >> >> I’ve been playing around with DataFusion to explore the feasibility of >> replacing current python/pandas data processing jobs with Rust/datafusion. >> Ultimately, looking to improve performance / decrease cost. >> >> >> >> I was doing some simple tests to start to measure performance differences >> on a simple task (read a csv[1] and filter it). >> >> >> >> Reading the csv datafusion seemed to outperform pandas by around 30% >> which was nice. >> >> *Rust took around 20-25ms to read the csv (compared to 32ms from pandas) >> >> >> >> However, when filtering the data I was surprised to see that pandas was >> way faster. >> >> *Rust took around 500-600ms to filter the csv(compared to 1ms from pandas) >> >> >> >> My code for each is below. I know I should be running the DataFusion >> times through something similar to pythons %timeit but I didn’t have that >> immediately accessible and I ran many times to confirm it was roughly >> consistent. >> >> >> >> Is this performance expected? Or am I using datafusion incorrectly? >> >> >> >> Any insight is much appreciated! >> >> >> >> [Rust] >> >> ``` >> >> use datafusion::error::Result; >> >> use datafusion::prelude::*; >> >> use std::time::Instant; >> >> >> >> #[tokio::main] >> >> async fn main() -> Result<()> { >> >> let start = Instant::now(); >> >> >> >> let mut ctx = ExecutionContext::new(); >> >> >> >> let ratings_csv = "ratings_small.csv"; >> >> >> >> let df = ctx.read_csv(ratings_csv, CsvReadOptions::new()).unwrap(); >> >> println!("Read CSV Duration: {:?}", start.elapsed()); >> >> >> >> let q_start = Instant::now(); >> >> let results = df >> >> .filter(col("userId").eq(lit(1)))? >> >> .collect() >> >> .await >> >> .unwrap(); >> >> println!("Filter duration: {:?}", q_start.elapsed()); >> >> >> >> println!("Duration: {:?}", start.elapsed()); >> >> >> >> Ok(()) >> >> } >> >> ``` >> >> >> >> [Python] >> >> ``` >> >> In [1]: df = pd.read_csv(“ratings_small.csv”) >> >> 32.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) >> >> >> >> In [2]: df.query(“userId==1”) >> >> 1.16 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >> >> ``` >> >> >> >> [1]: >> https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv >> >> >> >> >> >> *Matthew M. Turner* >> >> Email*:* [email protected] >> >> Phone: (908)-868-2786 >> >> >> >
