adriangb commented on PR #15700:
URL: https://github.com/apache/datafusion/pull/15700#issuecomment-3156469785

   I'm trying this out for our compaction system and am not able to get my sort 
to work without hitting memory limits. Note that I am using `datafusion-cli` 
but am not sure if it has a disk manager, etc. configured, but I figure if I 
can't reproduce it's maybe not obvious how to configure datafusion-cli so it's 
a fair question:
   
   In `q.sql`:
   
   ```sql
   -- About 6.32 GB of parquet compressed (~ 10 x compression ratio)
   -- Split into ~60 ~100 MB files
   CREATE EXTERNAL TABLE t1
   STORED AS PARQUET
   LOCATION '/Users/adriangb/Downloads/data/day=2025-08-05/';
   
   SET datafusion.execution.sort_spill_reservation_bytes = 0;
   
   COPY (
       SELECT *
       FROM t1
       ORDER BY deployment_environment, kind, service_name, trace_id
   )
   TO '/Users/adriangb/Downloads/out.parquet';
   ```
   
   ```shell
   ❯ ./target/release/datafusion-cli --mem-pool-type 'fair' --memory-limit '1g' 
-f q.sql
   DataFusion CLI v49.0.0
   0 row(s) fetched. 
   Elapsed 0.244 seconds.
   
   0 row(s) fetched. 
   Elapsed 0.000 seconds.
   
   +---------------+-------------------------------+
   | plan_type     | plan                          |
   +---------------+-------------------------------+
   | physical_plan | ┌───────────────────────────┐ |
   |               | │        DataSinkExec       │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │  SortPreservingMergeExec  │ |
   |               | │    --------------------   │ |
   |               | │ deployment_environment ASC│ |
   |               | │    NULLS LAST, kind ASC   │ |
   |               | │         NULLS LAST,       │ |
   |               | │        service_name       │ |
   |               | │       ASC NULLS LAST,     │ |
   |               | │     trace_id ASC NULLS    │ |
   |               | │            LAST           │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │          SortExec         │ |
   |               | │    --------------------   │ |
   |               | │ deployment_environment@35 │ |
   |               | │   ASC NULLS LAST, kind@6  │ |
   |               | │       ASC NULLS LAST,     │ |
   |               | │       service_name@27     │ |
   |               | │       ASC NULLS LAST,     │ |
   |               | │       trace_id@4 ASC      │ |
   |               | │         NULLS LAST        │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │       DataSourceExec      │ |
   |               | │    --------------------   │ |
   |               | │         files: 68         │ |
   |               | │      format: parquet      │ |
   |               | └───────────────────────────┘ |
   |               |                               |
   +---------------+-------------------------------+
   1 row(s) fetched. 
   Elapsed 0.254 seconds.
   
   Not enough memory to continue external sort. Consider increasing the memory 
limit, or decreasing sort_spill_reservation_bytes
   caused by
   Resources exhausted: Additional allocation failed with top memory consumers 
(across reservations) as:
     ExternalSorter[10]#25(can spill: true) consumed 78.2 MB,
     ExternalSorter[11]#27(can spill: true) consumed 77.2 MB,
     ExternalSorter[7]#19(can spill: true) consumed 75.7 MB.
   Error: Failed to allocate additional 90.1 MB for ExternalSorter[6] with 0.0 
B already allocated for this reservation - 82.2 MB remain available for the 
total pool
   ```
   
   I can maybe share the data with some sort of NDA but honestly it's not that 
interesting, it's just a lot of random data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to