Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: 5dcb53048a480d428fb17002cbe45fb315ffeef3
      
https://github.com/WebKit/WebKit/commit/5dcb53048a480d428fb17002cbe45fb315ffeef3
  Author: Wenson Hsieh <[email protected]>
  Date:   2026-06-08 (Mon, 08 Jun 2026)

  Changed paths:
    M LayoutTests/fast/text-extraction/debug-text-extraction-basic-expected.txt
    M LayoutTests/fast/text-extraction/debug-text-extraction-basic.html
    M Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.h
    M Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.mm
    M Source/WebCore/page/text-extraction/TextExtraction.cpp
    M Source/WebCore/page/text-extraction/TextExtractionTypes.h
    A Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.h
    A Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.mm
    M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
    M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
    M Source/WebKit/SourcesCocoa.txt
    M Source/WebKit/WebKit.xcodeproj/project.pbxproj

  Log Message:
  -----------
  [AutoFill Debugging] Fall back to extracting class names/id in the absence of 
any other attributes
https://bugs.webkit.org/show_bug.cgi?id=316489
rdar://178909853

Reviewed by Abrar Rahman Protyasha.

Opportunistically surface class names and ID attributes in the html, textTree, 
and JSON output
formats when an element would otherwise show up as nothing more than `uid=…`. 
However, only do this
if the class name or ID is likely to convey any semantic meaning; to achieve 
this…

1.  In the web process, we apply a cheap entropy filter (the existing 
isCandidateClassOrId
    predicate) to populate two new fields on `TextExtraction::Item` — 
`Vector<String> classNames`,
    capped at 5, and String `idAttribute`.

2.  In the UI process, the new `TextExtractionTokenizer` helper class further 
filters those
    candidates against `NLEmbedding` word vocabularies for `{en, de, es, fr, 
it, pt}` — but only
    when the element carries no other semantic signal. The gate suppresses the 
hint when any of the
    following are present: `accessibilityRole`, `title`, `ariaAttributes`, 
`clientAttributes`, an
    editable `label`/`placeholder`/`name`/`pattern`, `altText` on images, 
`name`/`autocomplete` on
    forms, `completedURL` on links, `origin` on iframes, or text children with 
trimmed length > 2.

The tokenizer splits on camelCase boundaries and HTML-attribute-safe symbols 
(-, _, :, ., /,
whitespace), drops pure-digit segments, and accepts the input when recognized 
tokens cover strictly
more than half of the total non-digit token characters across all six 
embeddings (English first,
early-exit per language). This "mostly recognized" rule lets common 
abbreviations like `btn` ride
along inside identifiers like `userMenuBtn` while still rejecting hashed/random 
class names. When
both id and class survive, id wins.

* LayoutTests/fast/text-extraction/debug-text-extraction-basic-expected.txt:
* LayoutTests/fast/text-extraction/debug-text-extraction-basic.html:

Cover both the surfacing paths (basic class hint, id beats class, 
half-recognized class) and the
gate suppression paths (text > 2 chars, aria-label, input placeholder).

* Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.h:
* Source/WebCore/PAL/pal/cocoa/NaturalLanguageSoftLink.mm:

* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::extractRecursive):

Populates classNames (capped at 5) and idAttribute on the per-item 
`TextExtraction::Item`, gated on
the existing `isCandidateClassOrId` entropy heuristic.

* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
* Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.h: Added.
* Source/WebKit/Platform/classifier/cocoa/TextExtractionTokenizer.mm: Added.
(WebKit::TextExtractionTokenizer::isMostlyRecognized):

Implements the "more than half of token characters are recognized" rule.

* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::recognizedClassesAndIdForItem):

Implements the suppression gate and id-over-class priority. Returns empty when 
the gate suppresses
the hint, in which case neither classes nor id are surfaced.

* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
* Source/WebKit/SourcesCocoa.txt:
* Source/WebKit/WebKit.xcodeproj/project.pbxproj:

Canonical link: https://commits.webkit.org/314753@main



To unsubscribe from these emails, change your notification settings at 
https://github.com/WebKit/WebKit/settings/notifications

Reply via email to