GitHub user mispp closed a discussion: Performance issue when loading 6.5gb
parquet file into memory
Is it expected that loading a ~6.5gb parquet file into memory has huge
difference between polars and datafusion?
Datafusion's `.cache()` method takes ~2 minutes. Loading same data with polars
takes ~15s.
> polars start -> 2023-07-10T22:43:25.690623200+02:00
> polars end -> 2023-07-10T22:43:40.854580400+02:00
> datafusion start -> 2023-07-10T22:43:41.363312400+02:00
> datafusion end -> 2023-07-10T22:45:32.949019300+02:00
Minimum working example is below.
Both are submitted with `cargo run` - if it makes a difference due to
`--release`.
Code:
```
use std::io::Error;
use polars::prelude::*;
use datafusion::prelude::*;
use chrono;
#[tokio::main]
async fn main() -> Result<(), Error> {
let _ = _dataframe2();
let _ = _datafusion().await;
Ok(())
}
pub async fn _datafusion() {
let _ctx = SessionContext::new();
let _read_options = ParquetReadOptions { file_extension: ".parquet",
table_partition_cols: vec!(), parquet_pruning: None, skip_metadata: None };
let _df =
_ctx.read_parquet("/mnt/d/Projects/testdf/data/test_data.parquet",
_read_options).await.unwrap();
println!("datafusion start -> {:?}", chrono::offset::Local::now());
let _cached = _df.cache().await;
println!("datafusion end -> {:?}", chrono::offset::Local::now());
}
pub fn _dataframe2() -> Result<String, PolarsError> {
let mut file =
std::fs::File::open("/mnt/d/Projects/testdf/data/test_data.parquet").unwrap();
println!("polars start -> {:?}", chrono::offset::Local::now());
let _df = ParquetReader::new(&mut file).finish().unwrap();
println!("polars end -> {:?}", chrono::offset::Local::now());
Ok("done".to_string())
}
```
Cargo.toml
```
[package]
name = "testdf"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at
https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
parquet = "40.0.0"
polars = { version = "0.30.0", features =
["lazy","temporal","describe","json","parquet","dtype-datetime","dtype-categorical",
"sql", "streaming", "serde-lazy", "ipc", "dynamic_groupby", "sort_multiple",
"rows", "dataframe_arithmetic", "partition_by"] }
serde = "1.0.163"
serde_json = "1.0.96"
connectorx = { version = "0.3.1", features = ["src_postgres", "dst_arrow",
"dst_arrow2"] }
datafusion = "27.0.0"
tokio = "1.0"
chrono = "0.4.26"
```
GitHub link: https://github.com/apache/datafusion/discussions/6908
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]