zeroshade commented on issue #506:
URL: https://github.com/apache/arrow-go/issues/506#issuecomment-3299556479

   So a couple things:
   
   Looking at the memory you posted
   
   > 3141.12MB 20.20% 20.20% 3141.12MB 20.20% 
github.com/apache/arrow-go/v18/parquet/internal/gen-go/parquet.NewColumnMetaData
 (inline)
   2053.06MB 13.20% 33.40% 2053.06MB 13.20% 
github.com/apache/arrow-go/v18/parquet/metadata.(*ColumnChunkMetaDataBuilder).Finish
   2022.75MB 13.01% 46.41% 2022.75MB 13.01% 
github.com/apache/arrow-go/v18/parquet/internal/gen-go/parquet.NewStatistics 
(inline)
   1683.18MB 10.82% 57.23% 6348.41MB 40.82% 
github.com/apache/arrow-go/v18/parquet/file.(*columnWriter).Close
   1630.68MB 10.49% 67.71% 5779.84MB 37.16% 
github.com/apache/arrow-go/v18/parquet/metadata.(*RowGroupMetaDataBuilder).NextColumnChunk
   1389.17MB 8.93% 76.65% 14417.33MB 92.70% 
github.com/apache/arrow-go/v18/parquet/file.(*rowGroupWriter).NextColumn
   
   Nearly all of this memory is coming from the Column metadata, with 909 
columns that makes sense. In addition, when you call `Write(record)` it is 
going to create a new row group every time unless you use `WriteBuffered`, and 
every time it creates a row group it's going to build and construct a bunch of 
metadata for all 909 columns. Because Parquet metadata goes at the end of the 
file it's going to keep it in memory until you finish and write it out by 
closing the file. The RowGroupWriter has `TotalCompressedBytes` and 
`TotalBytesWritten` methods which you could use to track when to break up row 
groups potentially.
   
   > I tried every 500000 records,close the filewirter,and recreate a new 
filewrite,the memory will release significantly.In this way,will create many 
small files,and the important,I am not sure when the memory is full,may be 
300000 or 1000000.Ideally, The memory is released,when the batch writing 
finished
   
   If every record is ~1.7KB, then 500000 is ~830MB before compression and 
encoding. It also might make sense to disable dictionary encoding depending on 
the cardinality of your columns (use `WithDictionaryDefault(false)` when 
building the properties). Depending on your workflows, that might be a 
perfectly reasonable size for your parquet files, not a "small" file. The 
memory for the records will be released after the batches are written and 
flushed to the file (once the GC reclaims it), but you're still going to build 
up a large amount of metadata when you're writing that many row groups and 
columns. 
   
   Can I see the `forceRowGroupFlush` function? Also what batch size are you 
using?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to