[ https://issues.apache.org/jira/browse/ARROW-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-10417: ----------------------------------- Fix Version/s: (was: 4.0.0) 5.0.0 > [Python][C++] Possible Memory Leak in RecordBatchStreamWriter and > RecordBatchFileWriter > --------------------------------------------------------------------------------------- > > Key: ARROW-10417 > URL: https://issues.apache.org/jira/browse/ARROW-10417 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.15.1, 2.0.0 > Environment: This is the config for my worker node: > resources: > - cpus: 1 > - maxMemoryMb: 4096 > - reservedMemoryMb: 2048 > Reporter: Shouheng Yi > Priority: Major > Fix For: 5.0.0 > > Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot > 2020-10-28 at 9.43.40 PM.png, Screen Shot 2020-10-29 at 9.22.58 AM.png, > arrow-10417-memtest-1.png > > > There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory > resources were not released. It always hit the memory limit and started doing > virtual memory swapping. See the picture below: > !Screen Shot 2020-10-28 at 9.43.32 PM.png! > This was the code: > {code:python} > import tempfile > import os > import sys > import pyarrow as pa > B = 1 > KB = 1024 * B > MB = 1024 * KB > schema = pa.schema( > [ > pa.field("a_string", pa.string()), > pa.field("an_int", pa.int32()), > pa.field("a_float", pa.float32()), > pa.field("a_list_of_floats", pa.list_(pa.float32())), > ] > ) > nrows_in_a_batch = 1000 > nbatches_in_a_table = 1000 > column_arrays = [ > ["string"] * nrows_in_a_batch, > [123] * nrows_in_a_batch, > [456.789] * nrows_in_a_batch, > [range(1000)] * nrows_in_a_batch, > ] > def main(sys_args) -> None: > batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema) > table = pa.Table.from_batches([batch] * nbatches_in_a_table, > schema=schema) > with tempfile.TemporaryDirectory() as tmpdir: > filename_template = "file-{n}.arror" > i = 0 > while True: > path = os.path.join(tmpdir, filename_template.format(n=i)) > i += 1 > with pa.OSFile(path, "w") as sink: > with pa.RecordBatchStreamWriter(sink, schema) as writer: > writer.write_table(table) > print(f"pa.total_allocated_bytes(): > {pa.total_allocated_bytes() / MB} mb") > if __name__ == "__main__": > main(sys.argv[1:]) > {code} > Strangely enough, printing {{total_allocated_bytes}}, it seemed normal. > {code:python} > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > pa.total_allocated_bytes(): 3.95556640625 mb > {code} > Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release > the resources? > [Updates 10/29/2020] > I tested on {{pyarrow==2.0.0}}. I still see the same issue. > !Screen Shot 2020-10-29 at 9.22.58 AM.png! > I changed {{RecordBatchStreamWriter}} to {{RecordBatchFileWriter}} in my > code, i.e.: > {code:python} > ... > with pa.OSFile(path, "w") as sink: > with pa.RecordBatchFileWriter(sink, schema) as writer: > writer.write_table(table) > print(f"pa.total_allocated_bytes(): > {pa.total_allocated_bytes() / MB} mb") > ... > {code} > I observed the same memory profile. I'm wondering if it is caused by > [WriteRecordBatch > |https://github.com/apache/arrow/blob/maint-0.15.x/cpp/src/arrow/ipc/writer.cc#L594] > not being able to release memory. -- This message was sent by Atlassian Jira (v8.3.4#803005)