joellubi commented on code in PR #2106:
URL: https://github.com/apache/arrow-adbc/pull/2106#discussion_r1735191482
##########
go/adbc/driver/snowflake/bulk_ingestion.go:
##########
@@ -595,3 +678,39 @@ func (bp *bufferPool) PutBuffer(buf *bytes.Buffer) {
buf.Reset()
bp.Pool.Put(buf)
}
+
+type fileSet struct {
+ mu sync.RWMutex
+ data map[string]struct{}
+}
Review Comment:
I actually started with `sync.Map` but changed to `map` + `sync.RMutex`
because I thought it was causing a bug. Turned out to be something else and
`sync.Map` works just fine.
I just pushed a commit. I left it wrapped in the `fileSet` struct because
using a naked `sync.Map` and having to remember to use the basename everywhere
+ writing the `Len()` iteration inline feels a little awkward IMHO. Let me know
what you think.
##########
go/adbc/driver/snowflake/bulk_ingestion.go:
##########
@@ -334,21 +347,32 @@ func writeParquet(
parquetProps *parquet.WriterProperties,
arrowProps pqarrow.ArrowWriterProperties,
) error {
+
+ // don't start a new parquet file unless there is at least one record
available
+ rec, ok := <-in
+ if !ok {
+ return ErrNoRecordsInStream
+ }
+
+ // initialize writer
pqWriter, err := pqarrow.NewFileWriter(schema, w, parquetProps,
arrowProps)
if err != nil {
return err
}
defer pqWriter.Close()
- var bytesWritten int64
- for rec := range in {
- if rec.NumRows() == 0 {
- rec.Release()
- continue
- }
+ // write first record
+ bytesWritten, err := writeRecordToParquet(pqWriter, rec)
+ if err != nil {
+ return err
+ }
+ if targetSize > 0 && bytesWritten >= int64(targetSize) {
+ return nil
+ }
Review Comment:
I revisited this and cleaned it up a bit. Between the refactor and added
comments it should be more clear what's going on here.
Basically I wanted to return early and avoid any of the parquet file setup
if there were no records left in the channel. Initially I grabbed the first
record in order to check, but then I needed to write that to the file before I
could continue to range over the remaining records.
Same idea after the refactor, just less duplication and explicitly
initializing the parquet writer on the first loop iteration.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]