On 5/2/12 10:50 AM, Tab Atkins Jr. wrote:
On Wed, May 2, 2012 at 9:59 AM, Charles Pritchard<ch...@jumis.com>  wrote:
There has been some discussion on the w3c/whatwg mailing lists about how far
we can mark up content with linguistic tags, such as marking word and/or
sentence boundaries.

In my authoring of web apps, I often write a short manual into a hidden div,
so that the vocabulary of my application can be processed by translation
services such as Google translate. Having content in the DOM seems the most
appropriate way to handle translation.

I'd like the group to consider the costs/benefits/alternatives to a "lang-"
attribute.
Such as<span lang-role="sentence">This is a sentence.</span>

The data- and aria- attributes have worked out well. We may want to make
room for one more.

Such a structure could be used to markup typical subject/object/verb and
clause sections; it could also be used to markup poetic texts as well as
defined meanings of content.

http://www.omegawiki.org/Expression:orange
This is an<span lang-meaning="DefinedMeaning:orange_(5821)">orange</span>.
Now this, this is<span
lang-meaning="DefinedMeaning:orange_(5822)">orange</span>.

In most cases there's no need to define sentence boundary, meaning or
otherwise. But, it'd sure be nice to have the ability to do so in a standard
manner.

I'd recommend role, meaning and prosody/pronunciation as the primary
targets. Character markup may be something to consider as it's come up in
SVG (rotate) and in CSS before. Doing a span for each character is not
practical, so we'd want a shorthand much as SVG has shorthand for rotate.
Do you expect outside services to do anything useful with this
information?  If not, the data-* attributes seem appropriate.

Yes, that's the primary reason. "services such as Google translate".
If you do expect that, have you evaluated the existing mechanisms for
embedding custom data in the page and found them wanting? If so, how?

1. Google translate gets a little loose with some markup, to where the translated content may be placed outside the span tag.

Such as: <div id="one">My potato is <span>hot</span></div>.

2. Some words can be ambiguous to the point that even a human reader may not know what the meaning is. It'd be great to have a mechanism to disambiguate.

3. Speech markup is cool, I like it, but we can have something a little lighter or even have some interplay with prosody.
<span>You say <span>potato</span>, I say <span>potato</span></span>.
(poteitoe, potahtoe)

4. CSS markup has come up a few times for sentence, word and character boundaries. Language is not static, it is very much human, and enabling humans to markup their language is what HTML is all about.

I'll put some effort in later this week to dig up a few threads on the CSS requests.

5. Services should never touch data-*; I've had to put all my content into markup anyway. I've had to add id attributes so I can identify it when it's translated by the UA or other service. Since I've done all that work, it'd be really nice to have some more options to add in, such as disambiguation, part of speech and occasionally, pronunciation and translation suggestions.

-Charles

Reply via email to