zeroshade commented on issue #2084: URL: https://github.com/apache/arrow-adbc/issues/2084#issuecomment-2297793453
> Some questions: at which point the data is considered sent to SF when adbc_ingest is used? How the backpressure is handled? Backpressure and concurrency are handled in two ways: 1. The `RecordBatchReader` which is passed to `adbc_ingest` is read from a single thread that will continuously call next on the reader and then push the record batch onto a channel. The buffer queue size (i.e. the max number of records queued for writing) is determined by the number of writers, controlled by the `adbc.snowflake.statement.ingest_writer_concurrency` option, which defaults to the number of CPUs. 2. The number of concurrent file uploads and copy tasks on the Snowflake side, controlled by the `adbc.snowflake.statement.ingest_upload_concurrency` and `adbc.snowflake.statement.ingest_copy_concurrency` options. > If I call adbc_ingest 1000 times with 4KB batches, is there a way to know how many actual parquets/copy streams were created? My personal recommendation would be to consolidate batches into fewer streams and call `adbc_ingest` with a consolidated streams of those batches rather than calling it 1000 times with 4KB batches which would also enable to you to have fewer batches in memory at a single time, etc. That said, you should be able to see how many actual parquet files / copy streams were created from your Snowflake monitoring which will show you all the copy tasks and files that are uploaded for the stage if you examine the queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
