zeroshade commented on issue #2084:
URL: https://github.com/apache/arrow-adbc/issues/2084#issuecomment-2297793453

   > Some questions: at which point the data is considered sent to SF when 
adbc_ingest is used? How the backpressure is handled?
   
   Backpressure and concurrency are handled in two ways:
   
   1. The `RecordBatchReader` which is passed to `adbc_ingest` is read from a 
single thread that will continuously call next on the reader and then push the 
record batch onto a channel. The buffer queue size (i.e. the max number of 
records queued for writing)  is determined by the number of writers, controlled 
by the `adbc.snowflake.statement.ingest_writer_concurrency` option, which 
defaults to the number of CPUs.
   2. The number of concurrent file uploads and copy tasks on the Snowflake 
side, controlled by the `adbc.snowflake.statement.ingest_upload_concurrency` 
and `adbc.snowflake.statement.ingest_copy_concurrency` options.
   
   > If I call adbc_ingest 1000 times with 4KB batches, is there a way to know 
how many actual parquets/copy streams were created?
   
   My personal recommendation would be to consolidate batches into fewer 
streams and call `adbc_ingest` with a consolidated streams of those batches 
rather than calling it 1000 times with 4KB batches which would also enable to 
you to have fewer batches in memory at a single time, etc. That said, you 
should be able to see how many actual parquet files / copy streams were created 
from your Snowflake monitoring which will show you all the copy tasks and files 
that are uploaded for the stage if you examine the queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to