I was curious and had a bit of free time.
So I spent a little time with the devil's friend (ie. a coding chatbot) and
I got the attached file.
This implements the "frontend" of method 2 in the previous email, where we
would take a PDF file and scrape
what we need to look at out of it, then compare the results of this
scraping.

The script I got is neat and very small (I haven't tested it for
correctness, but eyeballing it it to my eyes it seems to be
somewhere between "fairly close" and "fine").

Here's what it does:
 - open a pdf and for each page:
   - print out a classified log of all the graphics/character elements
(like a UDPS file, note at the moment the output is just a text dump)
       - the chatbot says this extraction method uses page-absolute
coordinates (this will need to be double-checked)
   - unpack all the fonts, and emit a list of per-glyph hashes built to
normalize away several immaterial differences (glyph names, curve order,
that kind of stuff)
      - it also normalizes CV locations, I think we should actually remove
that particular action, fwiw

As I type this, I realized I forgot to dump color of stuff (I somehow
remembered the color space information for the images, but not the color of
the glyphs or other graphics elements).

The way I see it, for 32 mins of investment it's an excellent start (and
excellent use of my time, in the sense that there is no way that in 30 mins
I would have learned how to use both mupdf _and_ fonttools, whereas now I
have a blueprint of where to go and I see clearly what to investigate
further when things don't work out).

The thing that by far I like the most is how small the script is. :-)

We could join it with the previous part, using JSON (per Han-Wen's
suggestion) to stitch the two instead of my UDPS thing and we'd have a
regtesting core in our hands

Separately I was kinda curious about the font extraction bit in its own
right: I seem to recall a discussion in the past where
someone (possibly Werner) explained to me that sometimes we join PDFs
that use the same font but in different subsets, and
we had trouble remerging these subsets after the fact and maybe ghostscript
didn't have a simple method to help us through this?
It seems to me this hashing approach could be a method to achieve that
remerging.
IIRC the duplicate subsets made our files concerningly larger, but
unfortunately
I completely forgot what exactly was the context where this
explanation came up.
I'm pretty sure this was 2-3 years ago :-(

Anyways, it seems like this is a bit more progress on the regtesting
discussion, I hope it can be helpful/useful

L


-- 
Luca Fascione
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
# Copyright Luca Fascione 2025
# ----------------------------

import fitz  # PyMuPDF
import hashlib
import io
from fontTools.ttLib import TTFont
from fontTools.pens.basePen import BasePen

# -------------------------------
# Glyph hashing with normalization
# -------------------------------


class ContourCollectorPen(BasePen):
    def __init__(self, glyphSet):
        super().__init__(glyphSet)
        self.contours = []
        self.current = []

    def _moveTo(self, p):
        if self.current:
            self.contours.append(self.current)
        self.current = [("M", p)]

    def _lineTo(self, p):
        self.current.append(("L", p))

    def _qCurveToOne(self, p1, p2):
        self.current.append(("Q", p1, p2))

    def _curveToOne(self, p1, p2, p3):
        self.current.append(("C", p1, p2, p3))

    def _closePath(self):
        if self.current:
            self.contours.append(self.current)
            self.current = []

    def _endPath(self):
        if self.current:
            self.contours.append(self.current)
            self.current = []


def normalize_contour(contour, quantize=64):
    """Normalize contour: round coords, rotate start, force clockwise."""
    pts = []
    for cmd in contour:
        if cmd[0] == "M":
            pts.append(cmd[1])
        elif cmd[0] == "L":
            pts.append(cmd[1])
        elif cmd[0] == "Q":
            pts.append(cmd[1]); pts.append(cmd[2])
        elif cmd[0] == "C":
            pts.append(cmd[1]); pts.append(cmd[2]); pts.append(cmd[3])

    qpts = [(round(x*quantize)/quantize, round(y*quantize)/quantize) for (x, y) in pts]

    if not qpts:
        return []

    # rotate start point
    min_idx = min(range(len(qpts)), key=lambda i: qpts[i])
    qpts = qpts[min_idx:] + qpts[:min_idx]

    # orientation: compute signed area
    area = 0
    for i in range(len(qpts)):
        x1, y1 = qpts[i]
        x2, y2 = qpts[(i+1) % len(qpts)]
        area += (x2 - x1) * (y2 + y1)
    if area > 0:  # counter-clockwise, reverse
        qpts.reverse()

    return tuple(qpts)


def hash_glyph_outline(ttfont, glyph_name):
    glyphSet = ttfont.getGlyphSet()
    glyph = glyphSet[glyph_name]
    pen = ContourCollectorPen(glyphSet)
    glyph.draw(pen)

    normalized_contours = []
    for contour in pen.contours:
        nc = normalize_contour(contour)
        if nc:
            normalized_contours.append(nc)

    def bbox(c):
        xs = [p[0] for p in c]; ys = [p[1] for p in c]
        return (min(xs), min(ys), max(xs), max(ys))
    normalized_contours.sort(key=bbox)

    serialized = str(normalized_contours).encode("utf-8")
    return hashlib.sha1(serialized).hexdigest()


def font_fingerprint(ttfont):
    """Compute per-glyph hashes and a font-level hash."""
    cmap = ttfont.getBestCmap() or {}
    glyph_hashes = {}
    for codepoint, gname in cmap.items():
        try:
            h = hash_glyph_outline(ttfont, gname)
            glyph_hashes[gname] = h
        except Exception:
            # Some glyphs may not have outlines (spaces etc.)
            continue
    # font-level fingerprint: hash of sorted glyph hashes
    all_hashes = sorted(glyph_hashes.values())
    serialized = "|".join(all_hashes).encode("utf-8")
    font_hash = hashlib.sha1(serialized).hexdigest()
    return glyph_hashes, font_hash


# -------------------------------
# Serializer (pluggable output)
# -------------------------------

class Serializer:
    def page_start(self, page_num):
        print(f"=== Page {page_num} ===")

    def page_end(self, page_num):
        print(f"=== End Page {page_num} ===\n")

    def glyph(self, font, gid, bbox):
        print(f"Glyph: font={font}, gid={gid}, bbox={bbox}")

    def graphics(self, kind, items, bbox):
        print(f"Graphics: kind={kind}, items={items}, bbox={bbox}")

    def image(self, xref, name, size, cs, bpc):
        print(f"Image xref={xref}, name={name}, size={size}, cs={cs}, bpc={bpc}")

    def font(self, font_name, embedded, subset, file_ref, fingerprint, glyph_count):
        print(f"Font: {font_name}, embedded={embedded}, subset={subset}, "
              f"file_ref={file_ref}, fingerprint={fingerprint}, glyphs={glyph_count}")

    def annotation(self, annot_type, bbox):
        print(f"Annotation: type={annot_type}, bbox={bbox}")

    def formfield(self, field_type, field_name, rect):
        print(f"Form field: {field_type}, name={field_name}, rect={rect}")


# -------------------------------
# PDF walker
# -------------------------------

def dump_pdf_items(pdf_path, serializer=None):
    if serializer is None:
        serializer = Serializer()

    doc = fitz.open(pdf_path)

    for page_num, page in enumerate(doc, start=1):
        serializer.page_start(page_num)

        # --- TEXT & GLYPHS ---
        rawdict = page.get_text("rawdict")
        for block in rawdict["blocks"]:
            if block["type"] != 0:
                continue
            for line in block["lines"]:
                for span in line["spans"]:
                    fontname = span["font"]
                    for ch in span["chars"]:
                        bbox = ch["bbox"]
                        gid = ch.get("gid", None)
                        serializer.glyph(fontname, gid, bbox)

        # --- GRAPHICS ---
        for d in page.get_drawings():
            kind = d["type"]
            items = [(item[0], item[1:]) for item in d["items"]]
            serializer.graphics(kind, items, d['rect'])

        # --- IMAGES ---
        for img in page.get_images(full=True):
            xref, smask, w, h, cs, bpc, _, name, _ = img
            serializer.image(xref, name, (w, h), cs, bpc)

        # --- FONTS ---
        for font in page.get_fonts(full=True):
            font_xref, file_name, _, font_name, embedded, subset = font[:6]
            fingerprint, glyph_count = None, 0
            if embedded:
                ext, font_bytes, real_name = doc.extract_font(font_xref)
                try:
                    tt = TTFont(io.BytesIO(font_bytes))
                    glyph_hashes, font_hash = font_fingerprint(tt)
                    fingerprint, glyph_count = font_hash, len(glyph_hashes)
                except Exception as e:
                    fingerprint = f"error:{e}"
            serializer.font(font_name, embedded, subset, file_name, fingerprint, glyph_count)

        # --- ANNOTATIONS / FORMS ---
        for annot in page.annots() or []:
            serializer.annotation(annot.type[1], annot.rect)
        for widget in page.widgets():
            serializer.formfield(widget.field_type, widget.field_name, widget.rect)

        serializer.page_end(page_num)


if __name__ == "__main__":
    dump_pdf_items("example.pdf")

Reply via email to