Re: [Python/C++] Parallel Sort Possible?

2022-05-24 Thread Sasha Krassovsky
Hi Cedric, Yes it definitely is possible. There are roughly two popular ways: parallel merge sort and parallel bucket sort. Parallel merge sort sorts individual batches and then merges them according to some schedule. Parallel bucket sort samples the input data and does range partitioning into

[Python/C++] Parallel Sort Possible?

2022-05-24 Thread Cedric Yau
I've noticed in calling pyarrow.Table.sort_indices[1] and pyarrow.compute.array_sort_indices[2], which Table.sort_indices is based on, that CPU consumption maxes out a single core. Are there any ways to scale sorting beyond a single CPU? It looks like there is a custom Radix Sort implemented[3]

Re: Arrow compute/dataset design doc missing

2022-05-24 Thread Weston Pace
There are a few levels of loops. Two at the moment and three in the future. Some are fused and some are not. What we have right now is early stages, is not ideal, and there are people investigating and working on improvements. I can speak a little bit about where we want to go. An example may

Re: Arrow compute/dataset design doc missing

2022-05-24 Thread Shawn Yang
Hi Ion, thank you for your reply which recaps the history of arrow compute. Those links are very valuable for me to understand arrow compute internal. I took a quick for those documents and will take a deeper into those later. I have another question, does arrow compute supports loop fusion,