On 5/31/2015 5:33 AM, Chris-as-John wrote:

Yes, Asmus good post. But I don’t really think HTML, even a subset, is really the right solution.

The longer I think about this, what would be needed would be something like an "abstract" format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification.

There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions.

The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention).

The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like).

And finally, there would have to be a way to deal with "one-offs", such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters.

And so on.

It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such "inline text" as widely and effortlessly interchangeable as plain text is today (or at least nearly so).

By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data.

But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich "inline text" I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto.

A./

I’m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn’t know about that schema could go to that URL and download the schema, and check that the XML conforms to that schema.

Similarly, imagine a text format that had a header with something like:
\uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345

Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn’t be reliant on today’s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn’t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn’t previously aware of them, and the format would be independent of today’s rendering technologies. Let’s face it, HTML5 changes every few years, and I don’t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don’t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don’t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee.

As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined.


—
Chris


On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) <asmus-...@ix.netcom.com <mailto:asmus-...@ix.netcom.com>> wrote:

    reading this discussion, I agree with your reaductio ad absurdum
    of infinitely nested HTML.

    But I think you are onto something with your hypothetical example
    of the "subset that works in ALL textual situations".

    There's clearly a use case for something like it, and I believe
    many people would intuitively agree on a set of features for it.

    What people seem to have in mind is something like "inline" text.
    Something beyond a mere stream of plain text (with effectively
    every character rendered visibly), but still limited in important
    ways by general behavior of inline text: a string of it, laid out,
    must wrap and line break, any objects included in it must behave
    like characters (albeit of custom width, height and appearance),
    and so on. Paragraph formatting, stacked layout, header levels and
    all those good things would not be available.

    With such a subset clearly defined, many quirky limitations might
    no longer be necessary; any container that today only takes plain
    text could be upgraded to take "inline text". I can see some
    inline containers retaining a nesting limitation, but I could
    imagine that it is possible to arrive at a consistent definition
    of such inline format.

    Going further, I can't shake the impression that without a clean
    definition of an inline text format along those lines, any
    attempts at making stickers and similar solutions "stick" are
    doomed to failure.

    The interesting thing in defining such a format is not how to
    represent it in HTML or CSS syntax, but in describing what feature
    sets it must (minimally) support. Doing it that way would free
    existing implementations of rich text to map native formats onto
    that minimally required subset and to add them to their format
    translators for HMTL or whatever else they use for interchange.

    Only with a definition can you ever hope to develop a processing
    model. It won't be as simple as for plain text strings, but it
    should be able to support common abstractions (like iteration by
    logical unit). It would have to support the management of external
    resources - if the inline format allows images, custom fonts, etc.
    one would need a way to manage references to them in the local
    context.

    If your skeptical position proves correct in that this is
    something that turns out to not be tractable, then I think you've
    provided conclusive proof why stickers won't happen and why
    encoding emoji was the only sensible decision Unicode could have
    taken.

    A./

    On 5/30/2015 7:14 AM, John wrote:

    Hmm, these "once entities" of which you speak, do they require
    javascript? Because I'm not sure what we are looking for here is
    static documents requiring a full programming language.

    But let's say for a moment that html5 can, or could do the job
    here. Then to make the dream come true that you could just cut
    and paste text that happened to contain a custom character to
    somewhere else, and nothing untoward would happen, would mean
    that everything in the computing universe should allow full blown
    html. So every Java Swing component, every Apple gui component,
    every .NET component, every windows component, every browser,
    every Android and IOS component would allow text entry of HTML
    entities. OK, so let's say everyone agrees with this course of
    action, now the universal text format is HTML.

    But in this new world where anywhere that previously you could
    input text, you can now input full blown html, does that actually
    make sense? Does it make sense that you can for example, put full
    blown HTML inside a H1 tag in html itself? That's a lot of
    recursion going on there. Or in a MS-Excel cell? Or interspersed
    in some otherwise fairly regular text in a Word document?

    I suppose someone could define a strict limited subset of HTML to
    be that subset that makes sense in ALL textual situations. That
    subset would be something like just defining things that act like
    characters, and not like a full blown rendering engine. But who
    would define that subset? Not the HTML groups, because their
    mandate is to define full blown rendering engines. It would be
    more likely to be something like the unicode group.

    And also, in this brave new world where HTML5 is the new standard
    text format, what would the binary format of it be? I mean, if I
    have the string of unicode characters <IMG would that be HTML5
    image definition that should be rendered as such? Or would it be
    text that happens to contain greater than symbol, I, M and G? It
    would have to be the former I guess, and thereby there would no
    longer be a unicode symbol for the mathematical greater than
    symbol. Rather there would be a unicode symbol for opening a HTML
    tag, and the text code for greater than would be &gt; Never again
    would a computer store > to mean greater than. Do we want HTML to
    be so pervasive? Not sure it deserves that.

    And from a programmers point of view, he wants to be able to
    iterate over an array of characters and treat each one the same
    way, regardless if it is a custom character or not. Without that
    kind of programmatic abstraction, the whole thing can never gain
    traction. I don't think fully blown HTML embedded in your text
    can fulfill that. A very strictly defined subset, possibly could.
    Sure HTML5 can RENDER stuff adquately, if the only aim of the
    game is provide a correct rendering. But to be able to actually
    treat particular images embedded as characters, and have some
    programming library see that abstraction consistently, I'm not
    sure I'm convinced that is possible. Not without nailing down
    exactly what html elements in what particular circumstances
    constitute a "character".

    I guess in summary, yes we have the technology already to render
    anything. But I don't think the whole standards framework does
    anything to allow the computing universe to actually exchange
    custom characters as if they were just any other text. Someone
    would actually have to  work on a standard to do that, not just
    point to html5.


    On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy
    <verd...@wanadoo.fr <mailto:verd...@wanadoo.fr>>, wrote:


        2015-05-29 4:37 GMT+02:00 John <idou...@gmail.com
        <mailto:idou...@gmail.com>>:

            "Today the world goes very well with HTML(5) which is now
            the bext markup language for document (including for
            inserting embedded images that don’t require any external
            request”
            If I had a large document that reused a particular
            character thousands of times, would this HTML markup
            require embedding that character thousands of times, or
            could I define the character once at the beginning of the
            sequence, and then refer back to it in a space efficient way?


        HTML(5) allows defining *once* entities for images that can
        then be reused thousands of times without repeting their
        definition. You can do this as well with CSS styles, just
        define a class for a small element. This element may still be
        an "image", but the semantic is carried by the class you
        assign to it. You are not required to provide an external
        source URL for that image if the CSS style provides the content.

        You may also use PUAs for the same purpose (however I have
        not seen how CSS allows to style individual characters in
        text elements as these characters are not elements, and
        there's no defined selector for pseudo-elements matching a
        single character). PUAs are perfectly usable in the situation
        where you have embedded a custom font in your document for
        assigning glyphs to characters (you can still do that, but I
        would avoid TrueType/OpenType for this purpose, but would use
        the SVG font format which is valid in CSS, for defining a
        collection of glyphs).

        If the document is not restricted to be standalone, of course
        you can use links to an external shared CSS stylesheet and to
        this SVG font referenced by the stylesheet. With such
        approach, you don't even need to use classes on elements, you
        use plain-text with very compact PUAs (it's up to you to
        decide if the document must be standalone (embedding
        everything it needs) or must use external references for
        missing definitions, HTML allows both (and SVG as well when
        it contains plain-text elements).




Reply via email to