zeroshade commented on issue #506: URL: https://github.com/apache/arrow-go/issues/506#issuecomment-3299556479
So a couple things: Looking at the memory you posted > 3141.12MB 20.20% 20.20% 3141.12MB 20.20% github.com/apache/arrow-go/v18/parquet/internal/gen-go/parquet.NewColumnMetaData (inline) 2053.06MB 13.20% 33.40% 2053.06MB 13.20% github.com/apache/arrow-go/v18/parquet/metadata.(*ColumnChunkMetaDataBuilder).Finish 2022.75MB 13.01% 46.41% 2022.75MB 13.01% github.com/apache/arrow-go/v18/parquet/internal/gen-go/parquet.NewStatistics (inline) 1683.18MB 10.82% 57.23% 6348.41MB 40.82% github.com/apache/arrow-go/v18/parquet/file.(*columnWriter).Close 1630.68MB 10.49% 67.71% 5779.84MB 37.16% github.com/apache/arrow-go/v18/parquet/metadata.(*RowGroupMetaDataBuilder).NextColumnChunk 1389.17MB 8.93% 76.65% 14417.33MB 92.70% github.com/apache/arrow-go/v18/parquet/file.(*rowGroupWriter).NextColumn Nearly all of this memory is coming from the Column metadata, with 909 columns that makes sense. In addition, when you call `Write(record)` it is going to create a new row group every time unless you use `WriteBuffered`, and every time it creates a row group it's going to build and construct a bunch of metadata for all 909 columns. Because Parquet metadata goes at the end of the file it's going to keep it in memory until you finish and write it out by closing the file. The RowGroupWriter has `TotalCompressedBytes` and `TotalBytesWritten` methods which you could use to track when to break up row groups potentially. > I tried every 500000 records,close the filewirter,and recreate a new filewrite,the memory will release significantly.In this way,will create many small files,and the important,I am not sure when the memory is full,may be 300000 or 1000000.Ideally, The memory is released,when the batch writing finished If every record is ~1.7KB, then 500000 is ~830MB before compression and encoding. It also might make sense to disable dictionary encoding depending on the cardinality of your columns (use `WithDictionaryDefault(false)` when building the properties). Depending on your workflows, that might be a perfectly reasonable size for your parquet files, not a "small" file. The memory for the records will be released after the batches are written and flushed to the file (once the GC reclaims it), but you're still going to build up a large amount of metadata when you're writing that many row groups and columns. Can I see the `forceRowGroupFlush` function? Also what batch size are you using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
