timothydijamco commented on PR #45340: URL: https://github.com/apache/arrow/pull/45340#issuecomment-2638179566
Thanks for this PR, seems like a good idea to try out. In response to https://github.com/apache/arrow/issues/45287#issuecomment-2633425981: I didn't observe any difference between a version of Arrow with the metadata-clearing patch (#45330) vs a version of Arrow with the metadata-clearing patch plus this patch here in this PR. ### Synthetic data 500 files, each with 1 row and 10,000 columns with 200-character-long column names #### Peak memory Performing a "scan" or loading the table into memory | | Only metadata-clearing (#45330) | With metadata-clearing (#45340) and schema deduplication (#45340) | |-------------------------------------------------------------------------|---------------------------------|-------------------------------------------------------------------| | One "scan" (pull batches from `scanner->RecordBatchReader()` until exhausted) | 1.49GB | 1.48GB | | One `scanner->ToTable()` | 1.57GB | 1.55GB | #### Memory profiles Performing two "scans" | Only metadata-clearing (#45330) | With metadata-clearing (#45340) and schema deduplication (#45340) | |---------------------------------|-------------------------------------------------------------------| | <img width="999" alt="Screen Shot 2025-02-05 at 5 17 31 PM" src="https://github.com/user-attachments/assets/a7e3c47a-9aba-4efd-aee7-b106a6ba4cac" /> | <img width="1001" alt="Screen Shot 2025-02-05 at 5 17 48 PM" src="https://github.com/user-attachments/assets/8c21d257-a25a-4e09-91e7-9ecda3f4dbb4" /> | ### Real data I ran on a variety of real datasets we have internally (data size varies from <1GB of data to 40GB of data, num columns varies from hundreds to thousands, number of files varies from 1 to hundreds) in the "scan" use case and "load table" use case and did not observe any memory usage difference between the two Arrow versions as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
