This is an automated email from the ASF dual-hosted git repository. guanmingchiu pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/mahout.git
commit d317c34c13232c0444edc9f1692950cedeec705b Author: Ping <[email protected]> AuthorDate: Wed Dec 31 22:35:17 2025 +0800 [QDP] Rename benchmark throughput script and refresh docs (#776) Signed-off-by: 400Ping <[email protected]> --- qdp/DEVELOPMENT.md | 4 +- qdp/Makefile | 2 +- qdp/qdp-python/benchmark/README.md | 83 +++++++++++++++++++++- .../qdp-python/benchmark/benchmark_throughput.md | 10 ++- ...oader_throughput.py => benchmark_throughput.py} | 2 +- 5 files changed, 93 insertions(+), 8 deletions(-) diff --git a/qdp/DEVELOPMENT.md b/qdp/DEVELOPMENT.md index e8664802e..d28a3ffcf 100644 --- a/qdp/DEVELOPMENT.md +++ b/qdp/DEVELOPMENT.md @@ -168,7 +168,7 @@ You can also run individual tests manually from the `qdp-python/benchmark/` dire ```sh # Benchmark test for dataloader throughput -python benchmark_dataloader_throughput.py +python benchmark_throughput.py # E2E test python benchmark_e2e.py @@ -194,7 +194,7 @@ A: Check available GPUs with `nvidia-smi`. Verify GPU visibility with `echo $CUD ### Q: Benchmark tests fail or produce unexpected results -A: Ensure all dependencies are installed with `uv pip install -r benchmark/requirements.txt`. Check GPU memory availability using `nvidia-smi`. If you don't need qiskit/pennylane comparisons, uninstall them as mentioned in the [E2e test section](#e2e-tests). +A: Ensure all dependencies are installed with `uv sync --group benchmark` (from `qdp/qdp-python`). Check GPU memory availability using `nvidia-smi`. If you don't need qiskit/pennylane comparisons, uninstall them as mentioned in the [E2e test section](#e2e-tests). ### Q: Pre-commit hooks fail diff --git a/qdp/Makefile b/qdp/Makefile index 51f37a551..53572ccf8 100644 --- a/qdp/Makefile +++ b/qdp/Makefile @@ -48,7 +48,7 @@ install_benchmark: benchmark: install install_benchmark @echo "Running e2e benchmark tests..." uv run python qdp-python/benchmark/benchmark_e2e.py - uv run python qdp-python/benchmark/benchmark_dataloader_throughput.py + uv run python qdp-python/benchmark/benchmark_throughput.py run_nvtx_profile: $(eval EXAMPLE ?= nvtx_profile) diff --git a/qdp/qdp-python/benchmark/README.md b/qdp/qdp-python/benchmark/README.md index f8f413d41..6fcef290e 100644 --- a/qdp/qdp-python/benchmark/README.md +++ b/qdp/qdp-python/benchmark/README.md @@ -1,5 +1,86 @@ -<!-- TODO: benchmark docs --> +# Benchmarks +This directory contains Python benchmarks for Mahout QDP. There are two main +scripts: + +- `benchmark_e2e.py`: end-to-end latency from disk to GPU VRAM (includes IO, + normalization, encoding, transfer, and a dummy forward pass). +- `benchmark_throughput.py`: DataLoader-style throughput benchmark + that measures vectors/sec across Mahout, PennyLane, and Qiskit. + +## Quick Start + +From the repo root: + +```bash +cd qdp +make benchmark +``` + +This installs the QDP Python package (if needed), installs benchmark +dependencies, and runs both benchmarks. + +## Manual Setup + +```bash +cd qdp/qdp-python +uv sync --group benchmark +``` + +Then run benchmarks with `uv run python ...` or activate the virtual +environment and use `python ...`. + +## E2E Benchmark (Disk -> GPU) + +```bash +cd qdp/qdp-python/benchmark +python benchmark_e2e.py +``` + +Additional options: + +```bash +python benchmark_e2e.py --qubits 16 --samples 200 --frameworks mahout-parquet mahout-arrow +python benchmark_e2e.py --frameworks all +``` + +Notes: + +- `--frameworks` accepts a space-separated list or `all`. + Options: `mahout-parquet`, `mahout-arrow`, `pennylane`, `qiskit`. +- The script writes `final_benchmark_data.parquet` and + `final_benchmark_data.arrow` in the current working directory and overwrites + them on each run. +- If multiple frameworks run, the script compares output states for + correctness at the end. + +## DataLoader Throughput Benchmark + +Simulates a typical QML training loop by continuously loading batches of 64 +vectors (default). Goal: demonstrate that QDP can saturate GPU utilization and +avoid the "starvation" often seen in hybrid training loops. + +See `qdp/qdp-python/benchmark/benchmark_throughput.md` for details and example +output. + +```bash +cd qdp/qdp-python/benchmark +python benchmark_throughput.py --qubits 16 --batches 200 --batch-size 64 --prefetch 16 +python benchmark_throughput.py --frameworks mahout,pennylane +``` + +Notes: + +- `--frameworks` is a comma-separated list or `all`. + Options: `mahout`, `pennylane`, `qiskit`. +- Throughput is reported in vectors/sec (higher is better). + +## Dependency Notes + +- Qiskit and PennyLane are optional. If they are not installed, their benchmark + legs are skipped automatically. +- For Mahout-only runs, you can uninstall the competitor frameworks: + `uv pip uninstall qiskit pennylane`. ### We can also run benchmarks on colab notebooks(without owning a GPU) diff --git a/docs/benchmarks/dataloader_throughput.md b/qdp/qdp-python/benchmark/benchmark_throughput.md similarity index 86% rename from docs/benchmarks/dataloader_throughput.md rename to qdp/qdp-python/benchmark/benchmark_throughput.md index 242340c26..ba26f0e60 100644 --- a/docs/benchmarks/dataloader_throughput.md +++ b/qdp/qdp-python/benchmark/benchmark_throughput.md @@ -2,6 +2,10 @@ This benchmark mirrors the `qdp-core/examples/dataloader_throughput.rs` pipeline and compares Mahout (QDP) against PennyLane and Qiskit on the same workload. It streams batches from a CPU-side producer, encodes amplitude states on GPU, and reports vectors-per-second. +Goal: simulate a typical QML training loop by continuously loading batches of +64 vectors (default), showing that QDP can keep GPU utilization high and avoid +the "starvation" often seen in hybrid training loops. + ## Workload - Qubits: 16 (vector length `2^16`) @@ -15,11 +19,11 @@ This benchmark mirrors the `qdp-core/examples/dataloader_throughput.rs` pipeline # QDP-only Rust example cargo run -p qdp-core --example dataloader_throughput --release -# Cross-framework comparison (requires deps in qdp/benchmark/requirements.txt) -python qdp/benchmark/benchmark_dataloader_throughput.py --qubits 16 --batches 200 --batch-size 64 --prefetch 16 +# Cross-framework comparison (requires benchmark deps) +python qdp/qdp-python/benchmark/benchmark_throughput.py --qubits 16 --batches 200 --batch-size 64 --prefetch 16 # Run only Mahout + PennyLane legs -python qdp/benchmark/benchmark_dataloader_throughput.py --frameworks mahout,pennylane +python qdp/qdp-python/benchmark/benchmark_throughput.py --frameworks mahout,pennylane ``` ## Example Output diff --git a/qdp/qdp-python/benchmark/benchmark_dataloader_throughput.py b/qdp/qdp-python/benchmark/benchmark_throughput.py similarity index 99% rename from qdp/qdp-python/benchmark/benchmark_dataloader_throughput.py rename to qdp/qdp-python/benchmark/benchmark_throughput.py index 9ce974084..0d7916fec 100644 --- a/qdp/qdp-python/benchmark/benchmark_dataloader_throughput.py +++ b/qdp/qdp-python/benchmark/benchmark_throughput.py @@ -24,7 +24,7 @@ The workload mirrors the `qdp-core/examples/dataloader_throughput.rs` pipeline: - Encode vectors into amplitude states on GPU and run a tiny consumer op. Run: - python qdp/benchmark/benchmark_dataloader_throughput.py --qubits 16 --batches 200 --batch-size 64 + python qdp/benchmark/benchmark_throughput.py --qubits 16 --batches 200 --batch-size 64 """ import argparse
