Smotrov opened a new issue, #10897: URL: https://github.com/apache/datafusion/issues/10897
I'm using Rust, meanwhile I'm new to DataFusion. I need to repartition big dataset which is hundreds of GB. It is stored on S3 as multiple compressed packet files. It should be partitioned by the value of a column. Here is what I'm doing ```RUST // Define the partitioned Listing Table let listing_options = ListingOptions::new(file_format) .with_table_partition_cols(part) .with_target_partitions(1) .with_file_extension(".ndjson.zst"); ctx.register_listing_table( "data", format!("s3://{BUCKET_NAME}/data_lake/data_warehouse"), listing_options, Some(schema), None, ) .await?; let df = ctx .sql( r#" SELECT SUBSTRING("OriginalRequest", 9, 3) as dep, * FROM data WHERE /*partitions predicates here*/ "#, ) .await?; let s3 = AmazonS3Builder::new() .with_bucket_name(save_bucket_name) .with_region(REGION) .build()?; // Register the S3 store in DataFusion context let path = format!("s3://{save_bucket_name}"); let s3_url = Url::parse(&path).unwrap(); let arc_s3 = Arc::new(s3); ctx.runtime_env() .register_object_store(&s3_url, arc_s3.clone()); // Write the data as JSON partitioned by `dep` let output_path = "s3://my_bucket/output/json/"; //write as JSON to s3 let options = DataFrameWriteOptions::new() .with_partition_by(vec!["dep".to_string()]); let mut json_options = JsonOptions::default(); json_options.compression = CompressionTypeVariant::ZSTD; df .write_json(&output_path, options, Some(json_options)) .await?; ``` Will it swallow all memory and fail or it will be running in a kind on streaming format? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org