eric-maynard opened a new pull request, #21: URL: https://github.com/apache/polaris-tools/pull/21
This implements a new scenario, `WeightedWorkloadOnTreeDataset`, that supports the configuration of multiple **distributions** over which to weight reads & writes against the catalog. Compared with `ReadUpdateTreeDataset`, this allows us to understand how performance changes when reads or writes frequently hit the same tables. ### Sampling The distributions are defined in the config file like so: ``` # Distributions for readers # Each distribution will have `count` threads assigned to it # mean / variance describe the properties of the normal distribution # Readers will read a random table in the table space based on sampling # Default: [{ count = 8, mean = 0.3, variance = 0.0278 }] readers = [ { count = 8, mean = 0.3, variance = 0.0278 } ] ``` `count` is simply the number of threads which will sample from the distribution, while `mean` and `variance` describe the Gaussian distribution to sample from. These values are generally expected to fall between 0 and 1.0 and when they don't the distribution will be repeatedly **resampled**. For an extreme example, refer to the following: <img width="400" alt="Screenshot 2025-04-30 at 1 27 43 AM" src="https://github.com/user-attachments/assets/d77e98f1-7a94-463d-be82-0c47bbda92a1" /> In this case, about 50% of samples should fall below 0.0 and therefore be resampled. Once a value between 0 and 1 is obtained, this is mapped to a table, where 1.0 is the highest table (e.g. T_2048) in the tree dataset and 0.0 is T_0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@polaris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org