This is an automated email from the ASF dual-hosted git repository. guanmingchiu pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/mahout.git
commit 09ca3bf7d2b80de2c450580e22f1ca711f47b1f1 Author: Ping <[email protected]> AuthorDate: Mon Jan 5 18:43:03 2026 +0800 [QDP] add scaling test (Latency vs. Qubits) (#778) * The Scaling Test (Latency vs. Qubits) Signed-off-by: 400Ping <[email protected]> * fix pre-commit Signed-off-by: 400Ping <[email protected]> * [Chore] make initialization clearer & clearfy doc Signed-off-by: 400Ping <[email protected]> --------- Signed-off-by: 400Ping <[email protected]> --- qdp/DEVELOPMENT.md | 9 +- qdp/qdp-python/benchmark/README.md | 23 +- qdp/qdp-python/benchmark/benchmark_latency.md | 80 ++++++ qdp/qdp-python/benchmark/benchmark_latency.py | 369 ++++++++++++++++++++++++++ 4 files changed, 477 insertions(+), 4 deletions(-) diff --git a/qdp/DEVELOPMENT.md b/qdp/DEVELOPMENT.md index d28a3ffcf..2abe05860 100644 --- a/qdp/DEVELOPMENT.md +++ b/qdp/DEVELOPMENT.md @@ -167,11 +167,14 @@ uv pip uninstall qiskit pennylane You can also run individual tests manually from the `qdp-python/benchmark/` directory: ```sh -# Benchmark test for dataloader throughput -python benchmark_throughput.py - # E2E test python benchmark_e2e.py + +# Benchmark test for Data-to-State latency +python benchmark_latency.py + +# Benchmark test for dataloader throughput +python benchmark_throughput.py ``` ## Troubleshooting diff --git a/qdp/qdp-python/benchmark/README.md b/qdp/qdp-python/benchmark/README.md index 6fcef290e..d0ea49b29 100644 --- a/qdp/qdp-python/benchmark/README.md +++ b/qdp/qdp-python/benchmark/README.md @@ -1,12 +1,13 @@ # Benchmarks -This directory contains Python benchmarks for Mahout QDP. There are two main +This directory contains Python benchmarks for Mahout QDP. There are three main scripts: - `benchmark_e2e.py`: end-to-end latency from disk to GPU VRAM (includes IO, normalization, encoding, transfer, and a dummy forward pass). - `benchmark_throughput.py`: DataLoader-style throughput benchmark that measures vectors/sec across Mahout, PennyLane, and Qiskit. +- `benchmark_latency.py`: Data-to-State latency benchmark (CPU RAM -> GPU VRAM). ## Quick Start @@ -54,6 +55,26 @@ Notes: - If multiple frameworks run, the script compares output states for correctness at the end. +## Data-to-State Latency Benchmark + +```bash +cd qdp/qdp-python/benchmark +python benchmark_latency.py --qubits 16 --batches 200 --batch-size 64 --prefetch 16 +python benchmark_latency.py --frameworks mahout,pennylane +``` + +Notes: + +- `--frameworks` is a comma-separated list or `all`. + Options: `mahout`, `pennylane`, `qiskit-init`, `qiskit-statevector`. +- The latency test reports average milliseconds per vector. +- Flags: + - `--qubits`: controls vector length (`2^qubits`). + - `--batches`: number of host-side batches to stream. + - `--batch-size`: vectors per batch; raises total samples (`batches * batch-size`). + - `--prefetch`: CPU queue depth; higher values help keep the pipeline fed. +- See `qdp/qdp-python/benchmark/benchmark_latency.md` for details and example output. + ## DataLoader Throughput Benchmark Simulates a typical QML training loop by continuously loading batches of 64 diff --git a/qdp/qdp-python/benchmark/benchmark_latency.md b/qdp/qdp-python/benchmark/benchmark_latency.md new file mode 100644 index 000000000..e9a97d7a9 --- /dev/null +++ b/qdp/qdp-python/benchmark/benchmark_latency.md @@ -0,0 +1,80 @@ +# Data-to-State Latency Benchmark + +This benchmark isolates the "Data-to-State" pipeline (CPU RAM -> GPU VRAM) and +compares Mahout (QDP) against PennyLane and Qiskit baselines: + +- Qiskit Initialize (`qiskit-init`): circuit-based state preparation. +- Qiskit Statevector (`qiskit-statevector`): raw data loading baseline. + +The primary metric is average time-to-state in milliseconds (lower is better). + +## Workload + +- Qubits: 16 (vector length `2^16`) +- Batches: 200 +- Batch size: 64 +- Prefetch depth: 16 (CPU producer queue) + +## Running + +```bash +# Latency test (CPU RAM -> GPU VRAM) +python qdp/qdp-python/benchmark/benchmark_latency.py --qubits 16 \ + --batches 200 --batch-size 64 --prefetch 16 + +# Run only selected frameworks +python qdp/qdp-python/benchmark/benchmark_latency.py --frameworks mahout,pennylane +``` + +## Example Output + +``` +Generating 12800 samples of 16 qubits... + Batch size : 64 + Vector length: 65536 + Batches : 200 + Prefetch : 16 + Frameworks : pennylane, qiskit-init, qiskit-statevector, mahout + Generated 12800 samples + PennyLane/Qiskit format: 6400.00 MB + Mahout format: 6400.00 MB + +====================================================================== +DATA-TO-STATE LATENCY BENCHMARK: 16 Qubits, 12800 Samples +====================================================================== + +[PennyLane] Full Pipeline (DataLoader -> GPU)... + Total Time: 26.1952 s (2.047 ms/vector) + +[Qiskit Initialize] Full Pipeline (DataLoader -> GPU)... + Total Time: 975.8720 s (76.243 ms/vector) + +[Qiskit Statevector] Full Pipeline (DataLoader -> GPU)... + Total Time: 115.5840 s (9.030 ms/vector) + +[Mahout] Full Pipeline (DataLoader -> GPU)... + Total Time: 11.5384 s (0.901 ms/vector) + +====================================================================== +LATENCY (Lower is Better) +Samples: 12800, Qubits: 16 +====================================================================== +Mahout 0.901 ms/vector +PennyLane 2.047 ms/vector +Qiskit Statevector 9.030 ms/vector +Qiskit Initialize 76.243 ms/vector +---------------------------------------------------------------------- +Speedup vs PennyLane: 2.27x +Speedup vs Qiskit Init: 84.61x +Speedup vs Qiskit Statevec: 10.02x +``` + +## Notes + +- Latency numbers are average milliseconds per vector across the full run. +- PennyLane and Qiskit timings include CPU-side state preparation; Mahout timing + includes CPU->GPU encode + DLPack handoff. +- Missing frameworks are auto-skipped; use `--frameworks` to control the legs. +- Requires a CUDA-capable GPU (`torch.cuda.is_available()` must be true). +- Results vary by device, driver versions, and system load; re-run on target + hardware for representative numbers. diff --git a/qdp/qdp-python/benchmark/benchmark_latency.py b/qdp/qdp-python/benchmark/benchmark_latency.py new file mode 100644 index 000000000..bd6903f62 --- /dev/null +++ b/qdp/qdp-python/benchmark/benchmark_latency.py @@ -0,0 +1,369 @@ +#!/usr/bin/env python3 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Data-to-State latency benchmark: CPU RAM -> GPU VRAM. + +Run: + python qdp/qdp-python/benchmark/benchmark_latency.py --qubits 16 \ + --batches 200 --batch-size 64 --prefetch 16 +""" + +from __future__ import annotations + +import argparse +import queue +import threading +import time + +import numpy as np +import torch + +from mahout_qdp import QdpEngine + +BAR = "=" * 70 +SEP = "-" * 70 +FRAMEWORK_CHOICES = ("pennylane", "qiskit-init", "qiskit-statevector", "mahout") +FRAMEWORK_LABELS = { + "mahout": "Mahout", + "pennylane": "PennyLane", + "qiskit-init": "Qiskit Initialize", + "qiskit-statevector": "Qiskit Statevector", +} + +try: + import pennylane as qml + + HAS_PENNYLANE = True +except ImportError: + HAS_PENNYLANE = False + +try: + from qiskit import QuantumCircuit, transpile + from qiskit_aer import AerSimulator + from qiskit.quantum_info import Statevector + + HAS_QISKIT = True +except ImportError: + HAS_QISKIT = False + + +def sync_cuda() -> None: + if torch.cuda.is_available(): + torch.cuda.synchronize() + + +def build_sample(seed: int, vector_len: int) -> np.ndarray: + mask = np.uint64(vector_len - 1) + scale = 1.0 / vector_len + idx = np.arange(vector_len, dtype=np.uint64) + mixed = (idx + np.uint64(seed)) & mask + return mixed.astype(np.float64) * scale + + +def prefetched_batches( + total_batches: int, batch_size: int, vector_len: int, prefetch: int +): + q: queue.Queue[np.ndarray | None] = queue.Queue(maxsize=prefetch) + + def producer(): + for batch_idx in range(total_batches): + base = batch_idx * batch_size + batch = [build_sample(base + i, vector_len) for i in range(batch_size)] + q.put(np.stack(batch)) + q.put(None) + + threading.Thread(target=producer, daemon=True).start() + + while True: + batch = q.get() + if batch is None: + break + yield batch + + +def normalize_batch(batch: np.ndarray) -> np.ndarray: + norms = np.linalg.norm(batch, axis=1, keepdims=True) + norms[norms == 0] = 1.0 + return batch / norms + + +def parse_frameworks(raw: str) -> list[str]: + if raw.lower() == "all": + return list(FRAMEWORK_CHOICES) + + selected: list[str] = [] + for part in raw.split(","): + name = part.strip().lower() + if not name: + continue + if name not in FRAMEWORK_CHOICES: + raise ValueError( + f"Unknown framework '{name}'. Choose from: " + f"{', '.join(FRAMEWORK_CHOICES)} or 'all'." + ) + if name not in selected: + selected.append(name) + + return selected if selected else list(FRAMEWORK_CHOICES) + + +def run_mahout(num_qubits: int, total_batches: int, batch_size: int, prefetch: int): + try: + engine = QdpEngine(0) + except Exception as exc: + print(f"[Mahout] Init failed: {exc}") + return 0.0, 0.0 + + sync_cuda() + start = time.perf_counter() + processed = 0 + + for batch in prefetched_batches( + total_batches, batch_size, 1 << num_qubits, prefetch + ): + normalized = normalize_batch(batch) + for sample in normalized: + qtensor = engine.encode(sample.tolist(), num_qubits, "amplitude") + _ = torch.utils.dlpack.from_dlpack(qtensor) + processed += 1 + + sync_cuda() + duration = time.perf_counter() - start + latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0 + print(f" Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)") + return duration, latency_ms + + +def run_pennylane(num_qubits: int, total_batches: int, batch_size: int, prefetch: int): + if not HAS_PENNYLANE: + print("[PennyLane] Not installed, skipping.") + return 0.0, 0.0 + + dev = qml.device("default.qubit", wires=num_qubits) + + @qml.qnode(dev, interface="torch") + def circuit(inputs): + qml.AmplitudeEmbedding( + features=inputs, wires=range(num_qubits), normalize=True, pad_with=0.0 + ) + return qml.state() + + sync_cuda() + start = time.perf_counter() + processed = 0 + + for batch in prefetched_batches( + total_batches, batch_size, 1 << num_qubits, prefetch + ): + batch_cpu = torch.tensor(batch, dtype=torch.float64) + try: + state_cpu = circuit(batch_cpu) + except Exception: + state_cpu = torch.stack([circuit(x) for x in batch_cpu]) + _ = state_cpu.to("cuda", dtype=torch.complex64) + processed += len(batch_cpu) + + sync_cuda() + duration = time.perf_counter() - start + latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0 + print(f" Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)") + return duration, latency_ms + + +def run_qiskit_init( + num_qubits: int, total_batches: int, batch_size: int, prefetch: int +): + if not HAS_QISKIT: + print("[Qiskit] Not installed, skipping.") + return 0.0, 0.0 + + backend = AerSimulator(method="statevector") + sync_cuda() + start = time.perf_counter() + processed = 0 + + for batch in prefetched_batches( + total_batches, batch_size, 1 << num_qubits, prefetch + ): + normalized = normalize_batch(batch) + for vec in normalized: + qc = QuantumCircuit(num_qubits) + qc.initialize(vec, range(num_qubits)) + qc.save_statevector() + t_qc = transpile(qc, backend) + state = backend.run(t_qc).result().get_statevector().data + _ = torch.tensor(state, device="cuda", dtype=torch.complex64) + processed += 1 + + sync_cuda() + duration = time.perf_counter() - start + latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0 + print(f" Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)") + return duration, latency_ms + + +def run_qiskit_statevector( + num_qubits: int, total_batches: int, batch_size: int, prefetch: int +): + if not HAS_QISKIT: + print("[Qiskit] Not installed, skipping.") + return 0.0, 0.0 + + sync_cuda() + start = time.perf_counter() + processed = 0 + + for batch in prefetched_batches( + total_batches, batch_size, 1 << num_qubits, prefetch + ): + normalized = normalize_batch(batch) + for vec in normalized: + state = Statevector(vec) + _ = torch.tensor(state.data, device="cuda", dtype=torch.complex64) + processed += 1 + + sync_cuda() + duration = time.perf_counter() - start + latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0 + print(f" Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)") + return duration, latency_ms + + +def main(): + parser = argparse.ArgumentParser( + description="Benchmark Data-to-State latency across frameworks." + ) + parser.add_argument( + "--qubits", + type=int, + default=16, + help="Number of qubits (power-of-two vector length).", + ) + parser.add_argument("--batches", type=int, default=200, help="Total batches.") + parser.add_argument("--batch-size", type=int, default=64, help="Vectors per batch.") + parser.add_argument( + "--prefetch", type=int, default=16, help="CPU-side prefetch depth." + ) + parser.add_argument( + "--frameworks", + type=str, + default="all", + help=( + "Comma-separated list of frameworks to run " + "(pennylane,qiskit-init,qiskit-statevector,mahout) or 'all'." + ), + ) + args = parser.parse_args() + + if not torch.cuda.is_available(): + raise SystemExit("CUDA device not available; GPU is required.") + + try: + frameworks = parse_frameworks(args.frameworks) + except ValueError as exc: + parser.error(str(exc)) + + total_vectors = args.batches * args.batch_size + vector_len = 1 << args.qubits + + print(f"Generating {total_vectors} samples of {args.qubits} qubits...") + print(f" Batch size : {args.batch_size}") + print(f" Vector length: {vector_len}") + print(f" Batches : {args.batches}") + print(f" Prefetch : {args.prefetch}") + print(f" Frameworks : {', '.join(frameworks)}") + bytes_per_vec = vector_len * 8 + print(f" Generated {total_vectors} samples") + print( + f" PennyLane/Qiskit format: {total_vectors * bytes_per_vec / (1024 * 1024):.2f} MB" + ) + print(f" Mahout format: {total_vectors * bytes_per_vec / (1024 * 1024):.2f} MB") + print() + + print(BAR) + print( + f"DATA-TO-STATE LATENCY BENCHMARK: {args.qubits} Qubits, {total_vectors} Samples" + ) + print(BAR) + + t_pl = l_pl = 0.0 + t_q_init = l_q_init = 0.0 + t_q_sv = l_q_sv = 0.0 + t_mahout = l_mahout = 0.0 + + if "pennylane" in frameworks: + print() + print("[PennyLane] Full Pipeline (DataLoader -> GPU)...") + t_pl, l_pl = run_pennylane( + args.qubits, args.batches, args.batch_size, args.prefetch + ) + + if "qiskit-init" in frameworks: + print() + print("[Qiskit Initialize] Full Pipeline (DataLoader -> GPU)...") + t_q_init, l_q_init = run_qiskit_init( + args.qubits, args.batches, args.batch_size, args.prefetch + ) + + if "qiskit-statevector" in frameworks: + print() + print("[Qiskit Statevector] Full Pipeline (DataLoader -> GPU)...") + t_q_sv, l_q_sv = run_qiskit_statevector( + args.qubits, args.batches, args.batch_size, args.prefetch + ) + + if "mahout" in frameworks: + print() + print("[Mahout] Full Pipeline (DataLoader -> GPU)...") + t_mahout, l_mahout = run_mahout( + args.qubits, args.batches, args.batch_size, args.prefetch + ) + + print() + print(BAR) + print("LATENCY (Lower is Better)") + print(f"Samples: {total_vectors}, Qubits: {args.qubits}") + print(BAR) + + latency_results = [] + if l_pl > 0: + latency_results.append((FRAMEWORK_LABELS["pennylane"], l_pl)) + if l_q_init > 0: + latency_results.append((FRAMEWORK_LABELS["qiskit-init"], l_q_init)) + if l_q_sv > 0: + latency_results.append((FRAMEWORK_LABELS["qiskit-statevector"], l_q_sv)) + if l_mahout > 0: + latency_results.append((FRAMEWORK_LABELS["mahout"], l_mahout)) + + latency_results.sort(key=lambda x: x[1]) + + for name, latency in latency_results: + print(f"{name:18s} {latency:10.3f} ms/vector") + + if l_mahout > 0: + print(SEP) + if l_pl > 0: + print(f"Speedup vs PennyLane: {l_pl / l_mahout:10.2f}x") + if l_q_init > 0: + print(f"Speedup vs Qiskit Init: {l_q_init / l_mahout:10.2f}x") + if l_q_sv > 0: + print(f"Speedup vs Qiskit Statevec: {l_q_sv / l_mahout:10.2f}x") + + +if __name__ == "__main__": + main()
