joellubi commented on code in PR #38385: URL: https://github.com/apache/arrow/pull/38385#discussion_r1388725396
########## format/FlightSql.proto: ########## @@ -1778,6 +1794,47 @@ message CommandPreparedStatementUpdate { bytes prepared_statement_handle = 1; } +/* + * Represents a bulk ingestion request. Used in the command member of FlightDescriptor + * for the the RPC call DoPut to cause the server load the contents of the stream's + * FlightData into the target destination. + */ +message CommandStatementIngest { + option (experimental) = true; + + // Describes the behavior for loading bulk data. + enum IngestMode { + // Ingestion behavior unspecified. + INGEST_MODE_UNSPECIFIED = 0; + // Create the target table. Fail if the target table already exists. + INGEST_MODE_CREATE = 1; + // Append to an existing target table. Fail if the target table does not exist. + INGEST_MODE_APPEND = 2; + // Drop the target table if it exists. Then follow INGEST_MODE_CREATE behavior. + INGEST_MODE_REPLACE = 3; + // Create the target table if it does not exist. Then follow INGEST_MODE_APPEND behavior. + INGEST_MODE_CREATE_APPEND = 4; + } + + // The ingestion behavior. + IngestMode mode = 1; + // The table to load data into. + string table = 2; + // The db_schema of the destination table to load data into. If unset, a backend-specific default may be used. + optional string schema = 3; + // The catalog of the destination table to load data into. If unset, a backend-specific default may be used. + optional string catalog = 4; + // Use a temporary table. + optional bool temporary = 5; Review Comment: I took some time to map out how this could work in the various ingestion scenarios and ran into some trouble with certain edge cases. After reviewing how several DB engines I'm familiar with handle these scenarios, I can see the benefit of making an opinionated choice regarding the semantics. Based on my observations, there are at least **two** qualities of temporary tables that most clients would expect: 1. They are dropped automatically by the DB at the end of the session. The scope of the session could be a connection, transaction, etc. The specific scope seems to be backend-specific, and sometimes configurable. 2. Another quality that seems common to implementations I've seen is that you treat temp tables _exactly the same_ as a "regular" table after they've been created, for the duration of the session. This means that the only time a client would specify that a table is temporary is during creation (i.e. `CREATE` vs `CREATE TEMP`). For this reason I take back my statement implying that `temporary` is unrelated to table creation. To the contrary it seems that table creation is the only place where `temporary` should have an effect. This is based on DBs I'm familiar with, so I would appreciate your perspectives on whether these expectations are broadly applicable. Putting these observations together, I would propose the following semantics when used with `temporary`: - `INGEST_MODE_CREATE`: Create a temp table. Fail if it already exists. Should create even if a non-temp table with the same name already exists. - `INGEST_MODE_APPEND`: Ignore the temporary argument. Append to the table specified. The backend can tell whether it's temporary or not. (I'm ok with returning an error here instead if there's a good reason to do so) - `INGEST_MODE_REPLACE`: Drop the table if it exists. Then follow INGEST_MODE_CREATE behavior. Fail if the target is not a temp table (might help avoid unexpected data loss). - `INGEST_MODE_CREATE_APPEND`: Create a temp table if it does not exist. Then follow INGEST_MODE_APPEND behavior. Thoughts on these conclusions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org