(mahout) 47/50: [QDP] add scaling test (Latency vs. Qubits) (#778)

guanmingchiu Tue, 06 Jan 2026 08:45:39 -0800

This is an automated email from the ASF dual-hosted git repository.

guanmingchiu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/mahout.git


commit 09ca3bf7d2b80de2c450580e22f1ca711f47b1f1
Author: Ping <[email protected]>
AuthorDate: Mon Jan 5 18:43:03 2026 +0800

    [QDP] add scaling test (Latency vs. Qubits) (#778)
    
    * The Scaling Test (Latency vs. Qubits)
    
    Signed-off-by: 400Ping <[email protected]>
    
    * fix pre-commit
    
    Signed-off-by: 400Ping <[email protected]>
    
    * [Chore] make initialization clearer & clearfy doc
    
    Signed-off-by: 400Ping <[email protected]>
    
    ---------
    
    Signed-off-by: 400Ping <[email protected]>
---
 qdp/DEVELOPMENT.md                            |   9 +-
 qdp/qdp-python/benchmark/README.md            |  23 +-
 qdp/qdp-python/benchmark/benchmark_latency.md |  80 ++++++
 qdp/qdp-python/benchmark/benchmark_latency.py | 369 ++++++++++++++++++++++++++
 4 files changed, 477 insertions(+), 4 deletions(-)

diff --git a/qdp/DEVELOPMENT.md b/qdp/DEVELOPMENT.md
index d28a3ffcf..2abe05860 100644
--- a/qdp/DEVELOPMENT.md
+++ b/qdp/DEVELOPMENT.md
@@ -167,11 +167,14 @@ uv pip uninstall qiskit pennylane
 You can also run individual tests manually from the `qdp-python/benchmark/` 
directory:
 
 ```sh
-# Benchmark test for dataloader throughput
-python benchmark_throughput.py
-
 # E2E test
 python benchmark_e2e.py
+
+# Benchmark test for Data-to-State latency
+python benchmark_latency.py
+
+# Benchmark test for dataloader throughput
+python benchmark_throughput.py
 ```
 
 ## Troubleshooting
diff --git a/qdp/qdp-python/benchmark/README.md 
b/qdp/qdp-python/benchmark/README.md
index 6fcef290e..d0ea49b29 100644
--- a/qdp/qdp-python/benchmark/README.md
+++ b/qdp/qdp-python/benchmark/README.md
@@ -1,12 +1,13 @@
 # Benchmarks
 
-This directory contains Python benchmarks for Mahout QDP. There are two main
+This directory contains Python benchmarks for Mahout QDP. There are three main
 scripts:
 
 - `benchmark_e2e.py`: end-to-end latency from disk to GPU VRAM (includes IO,
   normalization, encoding, transfer, and a dummy forward pass).
 - `benchmark_throughput.py`: DataLoader-style throughput benchmark
   that measures vectors/sec across Mahout, PennyLane, and Qiskit.
+- `benchmark_latency.py`: Data-to-State latency benchmark (CPU RAM -> GPU 
VRAM).
 
 ## Quick Start
 
@@ -54,6 +55,26 @@ Notes:
 - If multiple frameworks run, the script compares output states for
   correctness at the end.
 
+## Data-to-State Latency Benchmark
+
+```bash
+cd qdp/qdp-python/benchmark
+python benchmark_latency.py --qubits 16 --batches 200 --batch-size 64 
--prefetch 16
+python benchmark_latency.py --frameworks mahout,pennylane
+```
+
+Notes:
+
+- `--frameworks` is a comma-separated list or `all`.
+  Options: `mahout`, `pennylane`, `qiskit-init`, `qiskit-statevector`.
+- The latency test reports average milliseconds per vector.
+- Flags:
+  - `--qubits`: controls vector length (`2^qubits`).
+  - `--batches`: number of host-side batches to stream.
+  - `--batch-size`: vectors per batch; raises total samples (`batches * 
batch-size`).
+  - `--prefetch`: CPU queue depth; higher values help keep the pipeline fed.
+- See `qdp/qdp-python/benchmark/benchmark_latency.md` for details and example 
output.
+
 ## DataLoader Throughput Benchmark
 
 Simulates a typical QML training loop by continuously loading batches of 64
diff --git a/qdp/qdp-python/benchmark/benchmark_latency.md 
b/qdp/qdp-python/benchmark/benchmark_latency.md
new file mode 100644
index 000000000..e9a97d7a9
--- /dev/null
+++ b/qdp/qdp-python/benchmark/benchmark_latency.md
@@ -0,0 +1,80 @@
+# Data-to-State Latency Benchmark
+
+This benchmark isolates the "Data-to-State" pipeline (CPU RAM -> GPU VRAM) and
+compares Mahout (QDP) against PennyLane and Qiskit baselines:
+
+- Qiskit Initialize (`qiskit-init`): circuit-based state preparation.
+- Qiskit Statevector (`qiskit-statevector`): raw data loading baseline.
+
+The primary metric is average time-to-state in milliseconds (lower is better).
+
+## Workload
+
+- Qubits: 16 (vector length `2^16`)
+- Batches: 200
+- Batch size: 64
+- Prefetch depth: 16 (CPU producer queue)
+
+## Running
+
+```bash
+# Latency test (CPU RAM -> GPU VRAM)
+python qdp/qdp-python/benchmark/benchmark_latency.py --qubits 16 \
+  --batches 200 --batch-size 64 --prefetch 16
+
+# Run only selected frameworks
+python qdp/qdp-python/benchmark/benchmark_latency.py --frameworks 
mahout,pennylane
+```
+
+## Example Output
+
+```
+Generating 12800 samples of 16 qubits...
+  Batch size   : 64
+  Vector length: 65536
+  Batches      : 200
+  Prefetch     : 16
+  Frameworks   : pennylane, qiskit-init, qiskit-statevector, mahout
+  Generated 12800 samples
+  PennyLane/Qiskit format: 6400.00 MB
+  Mahout format: 6400.00 MB
+
+======================================================================
+DATA-TO-STATE LATENCY BENCHMARK: 16 Qubits, 12800 Samples
+======================================================================
+
+[PennyLane] Full Pipeline (DataLoader -> GPU)...
+  Total Time: 26.1952 s (2.047 ms/vector)
+
+[Qiskit Initialize] Full Pipeline (DataLoader -> GPU)...
+  Total Time: 975.8720 s (76.243 ms/vector)
+
+[Qiskit Statevector] Full Pipeline (DataLoader -> GPU)...
+  Total Time: 115.5840 s (9.030 ms/vector)
+
+[Mahout] Full Pipeline (DataLoader -> GPU)...
+  Total Time: 11.5384 s (0.901 ms/vector)
+
+======================================================================
+LATENCY (Lower is Better)
+Samples: 12800, Qubits: 16
+======================================================================
+Mahout             0.901 ms/vector
+PennyLane          2.047 ms/vector
+Qiskit Statevector 9.030 ms/vector
+Qiskit Initialize  76.243 ms/vector
+----------------------------------------------------------------------
+Speedup vs PennyLane:       2.27x
+Speedup vs Qiskit Init:    84.61x
+Speedup vs Qiskit Statevec: 10.02x
+```
+
+## Notes
+
+- Latency numbers are average milliseconds per vector across the full run.
+- PennyLane and Qiskit timings include CPU-side state preparation; Mahout 
timing
+  includes CPU->GPU encode + DLPack handoff.
+- Missing frameworks are auto-skipped; use `--frameworks` to control the legs.
+- Requires a CUDA-capable GPU (`torch.cuda.is_available()` must be true).
+- Results vary by device, driver versions, and system load; re-run on target
+  hardware for representative numbers.
diff --git a/qdp/qdp-python/benchmark/benchmark_latency.py 
b/qdp/qdp-python/benchmark/benchmark_latency.py
new file mode 100644
index 000000000..bd6903f62
--- /dev/null
+++ b/qdp/qdp-python/benchmark/benchmark_latency.py
@@ -0,0 +1,369 @@
+#!/usr/bin/env python3
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Data-to-State latency benchmark: CPU RAM -> GPU VRAM.
+
+Run:
+    python qdp/qdp-python/benchmark/benchmark_latency.py --qubits 16 \
+        --batches 200 --batch-size 64 --prefetch 16
+"""
+
+from __future__ import annotations
+
+import argparse
+import queue
+import threading
+import time
+
+import numpy as np
+import torch
+
+from mahout_qdp import QdpEngine
+
+BAR = "=" * 70
+SEP = "-" * 70
+FRAMEWORK_CHOICES = ("pennylane", "qiskit-init", "qiskit-statevector", 
"mahout")
+FRAMEWORK_LABELS = {
+    "mahout": "Mahout",
+    "pennylane": "PennyLane",
+    "qiskit-init": "Qiskit Initialize",
+    "qiskit-statevector": "Qiskit Statevector",
+}
+
+try:
+    import pennylane as qml
+
+    HAS_PENNYLANE = True
+except ImportError:
+    HAS_PENNYLANE = False
+
+try:
+    from qiskit import QuantumCircuit, transpile
+    from qiskit_aer import AerSimulator
+    from qiskit.quantum_info import Statevector
+
+    HAS_QISKIT = True
+except ImportError:
+    HAS_QISKIT = False
+
+
+def sync_cuda() -> None:
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+
+
+def build_sample(seed: int, vector_len: int) -> np.ndarray:
+    mask = np.uint64(vector_len - 1)
+    scale = 1.0 / vector_len
+    idx = np.arange(vector_len, dtype=np.uint64)
+    mixed = (idx + np.uint64(seed)) & mask
+    return mixed.astype(np.float64) * scale
+
+
+def prefetched_batches(
+    total_batches: int, batch_size: int, vector_len: int, prefetch: int
+):
+    q: queue.Queue[np.ndarray | None] = queue.Queue(maxsize=prefetch)
+
+    def producer():
+        for batch_idx in range(total_batches):
+            base = batch_idx * batch_size
+            batch = [build_sample(base + i, vector_len) for i in 
range(batch_size)]
+            q.put(np.stack(batch))
+        q.put(None)
+
+    threading.Thread(target=producer, daemon=True).start()
+
+    while True:
+        batch = q.get()
+        if batch is None:
+            break
+        yield batch
+
+
+def normalize_batch(batch: np.ndarray) -> np.ndarray:
+    norms = np.linalg.norm(batch, axis=1, keepdims=True)
+    norms[norms == 0] = 1.0
+    return batch / norms
+
+
+def parse_frameworks(raw: str) -> list[str]:
+    if raw.lower() == "all":
+        return list(FRAMEWORK_CHOICES)
+
+    selected: list[str] = []
+    for part in raw.split(","):
+        name = part.strip().lower()
+        if not name:
+            continue
+        if name not in FRAMEWORK_CHOICES:
+            raise ValueError(
+                f"Unknown framework '{name}'. Choose from: "
+                f"{', '.join(FRAMEWORK_CHOICES)} or 'all'."
+            )
+        if name not in selected:
+            selected.append(name)
+
+    return selected if selected else list(FRAMEWORK_CHOICES)
+
+
+def run_mahout(num_qubits: int, total_batches: int, batch_size: int, prefetch: 
int):
+    try:
+        engine = QdpEngine(0)
+    except Exception as exc:
+        print(f"[Mahout] Init failed: {exc}")
+        return 0.0, 0.0
+
+    sync_cuda()
+    start = time.perf_counter()
+    processed = 0
+
+    for batch in prefetched_batches(
+        total_batches, batch_size, 1 << num_qubits, prefetch
+    ):
+        normalized = normalize_batch(batch)
+        for sample in normalized:
+            qtensor = engine.encode(sample.tolist(), num_qubits, "amplitude")
+            _ = torch.utils.dlpack.from_dlpack(qtensor)
+            processed += 1
+
+    sync_cuda()
+    duration = time.perf_counter() - start
+    latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0
+    print(f"  Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)")
+    return duration, latency_ms
+
+
+def run_pennylane(num_qubits: int, total_batches: int, batch_size: int, 
prefetch: int):
+    if not HAS_PENNYLANE:
+        print("[PennyLane] Not installed, skipping.")
+        return 0.0, 0.0
+
+    dev = qml.device("default.qubit", wires=num_qubits)
+
+    @qml.qnode(dev, interface="torch")
+    def circuit(inputs):
+        qml.AmplitudeEmbedding(
+            features=inputs, wires=range(num_qubits), normalize=True, 
pad_with=0.0
+        )
+        return qml.state()
+
+    sync_cuda()
+    start = time.perf_counter()
+    processed = 0
+
+    for batch in prefetched_batches(
+        total_batches, batch_size, 1 << num_qubits, prefetch
+    ):
+        batch_cpu = torch.tensor(batch, dtype=torch.float64)
+        try:
+            state_cpu = circuit(batch_cpu)
+        except Exception:
+            state_cpu = torch.stack([circuit(x) for x in batch_cpu])
+        _ = state_cpu.to("cuda", dtype=torch.complex64)
+        processed += len(batch_cpu)
+
+    sync_cuda()
+    duration = time.perf_counter() - start
+    latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0
+    print(f"  Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)")
+    return duration, latency_ms
+
+
+def run_qiskit_init(
+    num_qubits: int, total_batches: int, batch_size: int, prefetch: int
+):
+    if not HAS_QISKIT:
+        print("[Qiskit] Not installed, skipping.")
+        return 0.0, 0.0
+
+    backend = AerSimulator(method="statevector")
+    sync_cuda()
+    start = time.perf_counter()
+    processed = 0
+
+    for batch in prefetched_batches(
+        total_batches, batch_size, 1 << num_qubits, prefetch
+    ):
+        normalized = normalize_batch(batch)
+        for vec in normalized:
+            qc = QuantumCircuit(num_qubits)
+            qc.initialize(vec, range(num_qubits))
+            qc.save_statevector()
+            t_qc = transpile(qc, backend)
+            state = backend.run(t_qc).result().get_statevector().data
+            _ = torch.tensor(state, device="cuda", dtype=torch.complex64)
+            processed += 1
+
+    sync_cuda()
+    duration = time.perf_counter() - start
+    latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0
+    print(f"  Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)")
+    return duration, latency_ms
+
+
+def run_qiskit_statevector(
+    num_qubits: int, total_batches: int, batch_size: int, prefetch: int
+):
+    if not HAS_QISKIT:
+        print("[Qiskit] Not installed, skipping.")
+        return 0.0, 0.0
+
+    sync_cuda()
+    start = time.perf_counter()
+    processed = 0
+
+    for batch in prefetched_batches(
+        total_batches, batch_size, 1 << num_qubits, prefetch
+    ):
+        normalized = normalize_batch(batch)
+        for vec in normalized:
+            state = Statevector(vec)
+            _ = torch.tensor(state.data, device="cuda", dtype=torch.complex64)
+            processed += 1
+
+    sync_cuda()
+    duration = time.perf_counter() - start
+    latency_ms = (duration / processed) * 1000 if processed > 0 else 0.0
+    print(f"  Total Time: {duration:.4f} s ({latency_ms:.3f} ms/vector)")
+    return duration, latency_ms
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Benchmark Data-to-State latency across frameworks."
+    )
+    parser.add_argument(
+        "--qubits",
+        type=int,
+        default=16,
+        help="Number of qubits (power-of-two vector length).",
+    )
+    parser.add_argument("--batches", type=int, default=200, help="Total 
batches.")
+    parser.add_argument("--batch-size", type=int, default=64, help="Vectors 
per batch.")
+    parser.add_argument(
+        "--prefetch", type=int, default=16, help="CPU-side prefetch depth."
+    )
+    parser.add_argument(
+        "--frameworks",
+        type=str,
+        default="all",
+        help=(
+            "Comma-separated list of frameworks to run "
+            "(pennylane,qiskit-init,qiskit-statevector,mahout) or 'all'."
+        ),
+    )
+    args = parser.parse_args()
+
+    if not torch.cuda.is_available():
+        raise SystemExit("CUDA device not available; GPU is required.")
+
+    try:
+        frameworks = parse_frameworks(args.frameworks)
+    except ValueError as exc:
+        parser.error(str(exc))
+
+    total_vectors = args.batches * args.batch_size
+    vector_len = 1 << args.qubits
+
+    print(f"Generating {total_vectors} samples of {args.qubits} qubits...")
+    print(f"  Batch size   : {args.batch_size}")
+    print(f"  Vector length: {vector_len}")
+    print(f"  Batches      : {args.batches}")
+    print(f"  Prefetch     : {args.prefetch}")
+    print(f"  Frameworks   : {', '.join(frameworks)}")
+    bytes_per_vec = vector_len * 8
+    print(f"  Generated {total_vectors} samples")
+    print(
+        f"  PennyLane/Qiskit format: {total_vectors * bytes_per_vec / (1024 * 
1024):.2f} MB"
+    )
+    print(f"  Mahout format: {total_vectors * bytes_per_vec / (1024 * 
1024):.2f} MB")
+    print()
+
+    print(BAR)
+    print(
+        f"DATA-TO-STATE LATENCY BENCHMARK: {args.qubits} Qubits, 
{total_vectors} Samples"
+    )
+    print(BAR)
+
+    t_pl = l_pl = 0.0
+    t_q_init = l_q_init = 0.0
+    t_q_sv = l_q_sv = 0.0
+    t_mahout = l_mahout = 0.0
+
+    if "pennylane" in frameworks:
+        print()
+        print("[PennyLane] Full Pipeline (DataLoader -> GPU)...")
+        t_pl, l_pl = run_pennylane(
+            args.qubits, args.batches, args.batch_size, args.prefetch
+        )
+
+    if "qiskit-init" in frameworks:
+        print()
+        print("[Qiskit Initialize] Full Pipeline (DataLoader -> GPU)...")
+        t_q_init, l_q_init = run_qiskit_init(
+            args.qubits, args.batches, args.batch_size, args.prefetch
+        )
+
+    if "qiskit-statevector" in frameworks:
+        print()
+        print("[Qiskit Statevector] Full Pipeline (DataLoader -> GPU)...")
+        t_q_sv, l_q_sv = run_qiskit_statevector(
+            args.qubits, args.batches, args.batch_size, args.prefetch
+        )
+
+    if "mahout" in frameworks:
+        print()
+        print("[Mahout] Full Pipeline (DataLoader -> GPU)...")
+        t_mahout, l_mahout = run_mahout(
+            args.qubits, args.batches, args.batch_size, args.prefetch
+        )
+
+    print()
+    print(BAR)
+    print("LATENCY (Lower is Better)")
+    print(f"Samples: {total_vectors}, Qubits: {args.qubits}")
+    print(BAR)
+
+    latency_results = []
+    if l_pl > 0:
+        latency_results.append((FRAMEWORK_LABELS["pennylane"], l_pl))
+    if l_q_init > 0:
+        latency_results.append((FRAMEWORK_LABELS["qiskit-init"], l_q_init))
+    if l_q_sv > 0:
+        latency_results.append((FRAMEWORK_LABELS["qiskit-statevector"], 
l_q_sv))
+    if l_mahout > 0:
+        latency_results.append((FRAMEWORK_LABELS["mahout"], l_mahout))
+
+    latency_results.sort(key=lambda x: x[1])
+
+    for name, latency in latency_results:
+        print(f"{name:18s} {latency:10.3f} ms/vector")
+
+    if l_mahout > 0:
+        print(SEP)
+        if l_pl > 0:
+            print(f"Speedup vs PennyLane:      {l_pl / l_mahout:10.2f}x")
+        if l_q_init > 0:
+            print(f"Speedup vs Qiskit Init:     {l_q_init / l_mahout:10.2f}x")
+        if l_q_sv > 0:
+            print(f"Speedup vs Qiskit Statevec: {l_q_sv / l_mahout:10.2f}x")
+
+
+if __name__ == "__main__":
+    main()

(mahout) 47/50: [QDP] add scaling test (Latency vs. Qubits) (#778)

Reply via email to