Smotrov opened a new issue, #10897:
URL: https://github.com/apache/datafusion/issues/10897

   I'm using Rust, meanwhile I'm new to DataFusion. 
   
   I need to repartition big dataset which is hundreds of GB. It is stored on 
S3 as multiple compressed packet files. 
   It should be partitioned by the value of a column. Here is what I'm doing 
   
   ```RUST
       // Define the partitioned Listing Table
       let listing_options = ListingOptions::new(file_format)
           .with_table_partition_cols(part)
           .with_target_partitions(1)
           .with_file_extension(".ndjson.zst");
   
       ctx.register_listing_table(
           "data",
           format!("s3://{BUCKET_NAME}/data_lake/data_warehouse"),
           listing_options,
           Some(schema),
           None,
       )
       .await?;
   
       let df = ctx
           .sql(
               r#"
          SELECT 
             SUBSTRING("OriginalRequest", 9, 3) as dep, *
          FROM data 
             WHERE 
              /*partitions predicates here*/
   
          "#,
           )
           .await?;
   
     let s3 = AmazonS3Builder::new()
           .with_bucket_name(save_bucket_name)
           .with_region(REGION)
           .build()?;
   
       // Register the S3 store in DataFusion context
       let path = format!("s3://{save_bucket_name}");
       let s3_url = Url::parse(&path).unwrap();
       let arc_s3 = Arc::new(s3);
       ctx.runtime_env()
           .register_object_store(&s3_url, arc_s3.clone());
   
       // Write the data as JSON partitioned by `dep`
       let output_path = "s3://my_bucket/output/json/";
       //write as JSON to s3
   
       let options = DataFrameWriteOptions::new()
        .with_partition_by(vec!["dep".to_string()]);
   
       let mut json_options = JsonOptions::default();
       json_options.compression = CompressionTypeVariant::ZSTD;
   
       df
           .write_json(&output_path, options, Some(json_options))
           .await?;
   ```
   
   Will it swallow all memory and fail or it will be running in a kind on 
streaming format?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to