Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-09-01 Thread via GitHub
soumya-ghosh closed pull request #1057: Optimize reads of record batches by pushing limit to file level URL: https://github.com/apache/iceberg-python/pull/1057 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-09-01 Thread via GitHub
soumya-ghosh commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2323414028 Yep, closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-09-01 Thread via GitHub
kevinjqliu commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2323398340 @soumya-ghosh since #1043 is merged, can we close this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-14 Thread via GitHub
kevinjqliu commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2290148441 Thanks @soumya-ghosh. I think #1043 is addressing this same issue. Can we use this PR to standardize a test suite for the read path to ensure the optimization is applied?

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-14 Thread via GitHub
soumya-ghosh commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2290024214 @sungwy @kevinjqliu I misunderstood the functioning of `_task_to_record_batches`, thus ended up making unnecessary changes, thanks for review comments. After running some

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-14 Thread via GitHub
soumya-ghosh commented on code in PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#discussion_r1717613511 ## pyiceberg/io/pyarrow.py: ## @@ -1366,6 +1373,7 @@ def project_table( case_sensitive, table_metadata.name_mapping(),

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-13 Thread via GitHub
kevinjqliu commented on code in PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#discussion_r1716133239 ## pyiceberg/io/pyarrow.py: ## @@ -1194,6 +1194,7 @@ def _task_to_record_batches( case_sensitive: bool, name_mapping: Optional[NameMapping] = None,

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-13 Thread via GitHub
sungwy commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2287395213 Hi @soumya-ghosh - thank you for picking this issue up! I'm working on refactoring this part of the code base, and I have a different, but similar approach for pushing the limit dow

Re: [PR] Optimize reads of record batches by pushing limit to file level [iceberg-python]

2024-08-13 Thread via GitHub
soumya-ghosh commented on PR #1057: URL: https://github.com/apache/iceberg-python/pull/1057#issuecomment-2287206429 @kevinjqliu any thoughts on this implementation? Is this what you had in mind? I have tested in a file of approx 50 MB and verified that fewer batches are scanned in this a