Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec 
spill
URL: https://github.com/apache/datafusion/issues/15323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec 
spill
URL: https://github.com/apache/datafusion/issues/15323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub


rluvaton commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781673294

   I created a draft PR with a solution, would appreciate your opinion:
   - #15608 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub


alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781577131

   > even if you use global tokio runtime and set the number of blocking 
threads to be a 1000 for example, there can be 1001 spill files. the problem is 
the same
   
   At some point the system is going to be IO bound so having more blocking 
threads doing I/O isn't going to help IO and will likely consume non trivial 
time context switching between them
   
   I think a better solution is to more carefully manage how many files are 
being spilled / read  at any time. This will be more complicated (as we'll 
likely have to do multiple merge phases, etc) but I think it is a better 
approach in the long run


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub


rluvaton commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781460121

   > 
   > Comet currently creates a new tokio runtime per plan but there is a 
proposal to move to a global tokio runtime (per executor) instead.
   > 
   > 
[apache/datafusion-comet#1590](https://github.com/apache/datafusion-comet/issues/1590)
   
   even if you use global tokio runtime and set the number of blocking threads 
to be 1000 for example, there can be 1001 spill files. the problem is the same


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub


andygrove commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781458966

   > I have a working version locally and will create a PR soon, just one 
problem, I don't think we can know the number of blocking threads tokio is 
configured with.
   > 
   > this is important as for example Comet set this by default to 10, and 
tokio default is 512 IIRC.
   > 
   > the working version can be improved with some optimization like prefetch 
and more, but it will be good enough for now and we can iterate further  
   
   Comet currently creates a new tokio runtime per plan but there is a proposal 
to move to a global tokio runtime (per executor) instead.
   
   https://github.com/apache/datafusion-comet/issues/1590


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub


rluvaton commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781454412

   I have a working version locally and will create a PR soon, just one 
problem, I don't think I can know the number of blocking threads tokio is 
configured with.
   
   this is important as for example Comet set this by default to 10, and tokio 
default is 512 IIRC.
   
   the working version can be improved with some optimization like prefetch and 
more, but it will be good enough for now and we can iterate further  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-03 Thread via GitHub


alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2776820608

   > I think I have the the same problem but in `AggregateExec` when using 
`row_hash`, as it spills as well and use `SortPreservingMergeStream`.
   > 
   > I think the solution should actually be in `SortPreservingMergeStream` 
rather than `SpillFileManager` no? although it does not spawn blocking threads 
it should support the multiple levels to merge
   
   I am not sure / familiar enough with the code to know off the top of my 
head. 
   
   I do think having hash and sort use the same codepath (that we can then go 
optimize a lot) sounds like a great idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-03 Thread via GitHub


rluvaton commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2776605858

   I think I have the the same problem but in `AggregateExec` when using 
`row_hash`, as it spills as well and use `SortPreservingMergeStream`.
   
   I think the solution should actually be in `SortPreservingMergeStream` 
rather than `SpillFileManager` no? although it does not spawn blocking threads 
it should support the multiple levels to merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-03-22 Thread via GitHub


alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2744217807

   Makes sense -- with 183 spill files, we probably would need to merge in 
stages
   
   For example starting with 183 spill files
   1. run 10 jobs, each merging about 10 files into one (results in 10 files)
   2. run the final merge of 10 files
   
   This results in 2x the IO (have to read / write each row twice) but it would 
be possible at least to parallelize the merges of the earlier step
   
   I think @2010YOUY01  was starting to look into a SpillFileManager -- this is 
the kind of behavior I would imagine being part of such a thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-03-21 Thread via GitHub


andygrove commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2743795290

   > Do you see too many threads when writing the spill files or when reading?
   
   This is when reading, during the merge operation.
   
   > In merge phase, each spill file will be wrapped by a stream backed by a 
blocking thread (see 
[read_spill_as_stream](https://github.com/apache/datafusion/blob/46.0.1/datafusion/physical-plan/src/spill.rs#L44-L55)),
 so we'll spawn at least 183 blocking threads when there are 183 spill files to 
merge spilled data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-03-20 Thread via GitHub


alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2741967802

   Do you see too many threads  when writing the spill files or when reading?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org