zeroshade commented on PR #421:
URL: https://github.com/apache/arrow-go/pull/421#issuecomment-3005699724

   > But the schema message would go on every request, not sure if I follow the 
logic here. The idea is to avoid sending the schema on every request.
   
   My suggestion was that instead of avoiding sending the schema on every 
request, the backend would just use whatever logic it does to validate the 
schema of the first record batch to validate the schema and thus it would be 
able to skip validating the record batch schema (since it was already validated 
by validating the Schema message). Just shifting the logic from the first 
record batch message to the schema message (nothing else about the logic would 
change). As I said, according to the spec leaving off the schema message is 
technically an invalid IPC stream.
   
   > To add another point, Pyarrow allows for reading the schema and 
recordbatches separately in IPC format:
   https://cloud.google.com/bigquery/docs/write-api-streaming#arrow-format
   
   You can already do the equivalent to that Python code in Go, though I guess 
the issue you run into is the lack of the padding handling. If we simply add a 
new method to the Payload struct, we can achieve the exact same logic. This PR 
could instead just be the following:
   
   ```go
   // a drawback to this is having to use bytes.Buffer to get the raw bytes
   // if you aren't already using an io.Writer.
   func (p *Payload) WritePayload(w io.Writer) (int, error) {
           return writeIPCPayload(w, *p)
   }
   
   // alternatively if we just want to get the raw bytes, we can do
   func (p *Payload) SerializedBytes() ([]byte, error) {
           var b bytes.Buffer
           _, err := writeIPCPayload(&b, *p)
           if err != nil {
                  return err
           }
           return b.Bytes(), nil
   }
   ```
   
   Then you can create the equivalent Go to the pyarrow example you provided, 
without needing to have an entire new writer.
   
   ```go
   func appendRows(tbl arrow.Table, projectID, datasetID, tableID string) error 
{
          // create request etc....
          
          schemaPayload := ipc.GetSchemaPayload(tbl.Schema(), 
memory.DefaultAllocator)    
          serializedSchemaBytes, err := schemaPayload.SerializedBytes()
          if err != nil {
                return err
          }
          // do whatever you want with the byte slice for the schema
   
          rdr := array.NewTableReader(tbl, tbl.NumRows())
          defer rdr.Release()
   
          // the pyarrow example only uses the first record batch, you probably 
would instead use 
          // for rdr.Next() to loop over all the batches.... but i'll mirror 
the pyarrow example for now
          rdr.Next()
          payload, err := ipc.GetRecordBatchPayload(rdr.Record())
          if err != nil {
               return err
          }
   
          serializedRecordBytes, err := payload.SerializedBytes()
          if err != nil {
                return err
           }
           // do whatever you like with the serializedRecordBytes
   
           // ....
   }       
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to