Branch: refs/heads/main
Home: https://github.com/WebKit/WebKit
Commit: 5dcb53048a480d428fb17002cbe45fb315ffeef3
https://github.com/WebKit/WebKit/commit/5dcb53048a480d428fb17002cbe45fb315ffeef3
Author: Wenson Hsieh <[email protected]>
Date: 2026-06-08 (Mon, 08 Jun 2026)
Changed paths:
M LayoutTests/fast/text-extraction/debug-text-extraction-basic-expected.txt
M LayoutTests/fast/text-extraction/debug-text-extraction-basic.html
M Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.h
M Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.mm
M Source/WebCore/page/text-extraction/TextExtraction.cpp
M Source/WebCore/page/text-extraction/TextExtractionTypes.h
A Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.h
A Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.mm
M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
M Source/WebKit/SourcesCocoa.txt
M Source/WebKit/WebKit.xcodeproj/project.pbxproj
Log Message:
-----------
[AutoFill Debugging] Fall back to extracting class names/id in the absence of
any other attributes
https://bugs.webkit.org/show_bug.cgi?id=316489
rdar://178909853
Reviewed by Abrar Rahman Protyasha.
Opportunistically surface class names and ID attributes in the html, textTree,
and JSON output
formats when an element would otherwise show up as nothing more than `uid=…`.
However, only do this
if the class name or ID is likely to convey any semantic meaning; to achieve
this…
1. In the web process, we apply a cheap entropy filter (the existing
isCandidateClassOrId
predicate) to populate two new fields on `TextExtraction::Item` —
`Vector<String> classNames`,
capped at 5, and String `idAttribute`.
2. In the UI process, the new `TextExtractionTokenizer` helper class further
filters those
candidates against `NLEmbedding` word vocabularies for `{en, de, es, fr,
it, pt}` — but only
when the element carries no other semantic signal. The gate suppresses the
hint when any of the
following are present: `accessibilityRole`, `title`, `ariaAttributes`,
`clientAttributes`, an
editable `label`/`placeholder`/`name`/`pattern`, `altText` on images,
`name`/`autocomplete` on
forms, `completedURL` on links, `origin` on iframes, or text children with
trimmed length > 2.
The tokenizer splits on camelCase boundaries and HTML-attribute-safe symbols
(-, _, :, ., /,
whitespace), drops pure-digit segments, and accepts the input when recognized
tokens cover strictly
more than half of the total non-digit token characters across all six
embeddings (English first,
early-exit per language). This "mostly recognized" rule lets common
abbreviations like `btn` ride
along inside identifiers like `userMenuBtn` while still rejecting hashed/random
class names. When
both id and class survive, id wins.
* LayoutTests/fast/text-extraction/debug-text-extraction-basic-expected.txt:
* LayoutTests/fast/text-extraction/debug-text-extraction-basic.html:
Cover both the surfacing paths (basic class hint, id beats class,
half-recognized class) and the
gate suppression paths (text > 2 chars, aria-label, input placeholder).
* Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.h:
* Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.mm:
* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::extractRecursive):
Populates classNames (capped at 5) and idAttribute on the per-item
`TextExtraction::Item`, gated on
the existing `isCandidateClassOrId` entropy heuristic.
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
* Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.h: Added.
* Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.mm: Added.
(WebKit::TextExtractionTokenizer::isMostlyRecognized):
Implements the "more than half of token characters are recognized" rule.
* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::recognizedClassesAndIdForItem):
Implements the suppression gate and id-over-class priority. Returns empty when
the gate suppresses
the hint, in which case neither classes nor id are surfaced.
* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
* Source/WebKit/SourcesCocoa.txt:
* Source/WebKit/WebKit.xcodeproj/project.pbxproj:
Canonical link: https://commits.webkit.org/314753@main
To unsubscribe from these emails, change your notification settings at
https://github.com/WebKit/WebKit/settings/notifications