zeroshade commented on issue #37976:
URL: https://github.com/apache/arrow/issues/37976#issuecomment-1743557463
So a couple things first:
you can check `if fw, ok := dt.(arrow.FixedWidthDataType); ok { return
fw.Bytes() }` which will get you the bytes per element for a fixed width data
type without needing to do the full type switch like you're doing)
Data types also have a `Layout()` method which returns a slice of
`BufferSpec` objects, if the `Kind` is `SpecFixedWidth` then there is a `Bytes`
member which will be the byte size. Again, allowing you to get this info
without needing to do the type switch or explicitly verify per data type.
You're also not including the size of the null bitmaps in your computation
which may be possibly non-negligible.
Now, the reason why you're getting that error, is because you are creating
multiple writers across the same stream, you need only one writer to write the
stream:
Instead of this:
```go
chunkSize := 4 * 1024 * 1024 // Bytes
recordChunks := sliceRecordByBytes(transformedRecord, chunkSize)
chunkSchema := recordChunks[0].Schema()
currentChunk := make([]arrow.Array, 0)
for _, rec := range recordChunks {
for i := 0; i < int(rec.NumCols()); i++ {
column := rec.Column(i)
currentChunk = append(currentChunk, column)
}
// Create a Flight writer
writeChunkToStream(server, chunkSchema, currentChunk)
currentChunk = nil
}
```
You should do this:
```go
rw := flight.NewRecordWriter(server,
ipc.WithSchema(transformedRecord.Schema()))
defer rw.Close()
chunkSize := 4 * 1024 * 1024 // Bytes
recordChunks := sliceRecordByBytes(transformedRecord, chunkSize)
defer func() {
for _, chunk := range recordChunks {
chunk.Release()
}
}()
for _, slice := range recordChunks {
if err := rw.Write(slice); err != nil {
return err
}
}
```
Every time you create a writer, the first thing it does is send a Schema
message, you don't want multiple writers. you just want to write the slices to
the writer separately. If you wanted, you could even combine these better and
instead of creating *all* the slices and then sending them one by one, you
could just find where you're going to slice, write that slice, call release on
it, and then find the next slice.... rinse and repeat. That way you don't need
a slice of records, and have fewer allocations.
Just an idea
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]