On Fri, 08 Jul 2011 00:33:14 +0200, Ian Hickson <i...@hixie.ch> wrote:

On Wed, 8 Jun 2011, Tomasz Jamroszczak wrote:

I've been looking into Microdata specification and it struck me, that
crawling algorithm is so complex, when it comes to expressing simple
ideas.  I think that foremost the algorithm should be described in the
specification with explanation what it's supposed to do, before steps of
what exactly is to be done are written.

Yeah. Turns out the algorithms involved here are quite badly broken.

It was intended to expose the microdata graph as completely as possible
while dropping anything that would introduce a loop, at the point where
the first repetition would start (so A->B->C=>A would break at the =),
in the API, in the JSON, and in the conformance rules. I didn't do a good
job speccing that, though!

I've fixed the algorithms to make sense (I hope).

http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-properties-of-an-item

I had a look at this to verify that it is black-box-equivalent to what Opera has implemented, and only discovered one issue:

<div itemprop=""> should not be added to the .properties collection, because it has no properties. My bad for suggesting that the criteria should be the presence of an itemprop attribute, it should be an itemprop attribute containing at least one token. Can you update the spec to match?

(I implemented the spec'd algorithm pedantically in <https://gitorious.org/microdatajs/microdatajs/commit/217cc34e7e679e2e4ea3e670a0dcdd155a7b9800> for verification, it passes the unit tests with said modification.)



On Wed, 29 Jun 2011, Philip Jägenstedt wrote:

Note also that other algorithms defined in terms of items and their
properties need to handle loopiness in some way. That's currently RDF,
vCard and iCal conversion. Perhaps something like "loopy item" could be
defined and those algorithms could skip loopy items wherever they occur?
Simply failing is also an acceptable solution, IMO.

I fixed vCard with a patch that just outputs "AGENT;TYPE=VCARD:ERROR" in
the case of a loop. (Can only happen if the input is non-conforming, so it
doesn't matter if the output is non-conforming.)

WFM

The vEvent stuff was already loop-safe.

The JSON algorithm now ends the crawl when it hits a loop, and replaces
the offending duplicate item with the string "ERROR".

WFM

The RDF algorithm preserves the loops, since doing so is possible with
RDF. Turns out the algorithm almost did this already, looks like it was an
oversight.

WFM, but note step 3: "Add a mapping from the item item to the subject subject in memory, if there isn't one already." Step 1 guarantees that there is no entry for item, so step 3 can be unconditional.



On Wed, 29 Jun 2011, Philip Jägenstedt wrote:

Indeed, multiple types doesn't work at all if you want to mix different
types. I was assuming that the use case was to extend types, kind of
like http://schema.org/Person/Governor. However, it doesn't work all
that well even in that case, since there's no way to know which type is
the extension of the other and which properties exist only on the
extended type.

I don't really understand this use case. Can you elaborate on the problem
that needs solving here?

It's whatever problem <http://schema.org/docs/extension.html> is trying to solve, which is something like "allow people to geek out with more specific vocabularies without interfering with search results". I whined a bit in <http://groups.google.com/group/schemaorg-discussion/browse_thread/thread/6de3a1761b115271>, the short story being:

* extensibility encoded with a microsyntax in the URL, making it not-so-opaque
 * such URLs make the DOM API less useful

Perhaps bending Microdata to accommodate for this is not the best idea. If I were schema.org, I would just encourage people to do this:

<div itemscope itemtype="http://schema.org/Person";>
  <div id="wrapper">
    <div itemprop="name">Arnold</div>
<div itemscope itemtype="http://example.com/Governor"; itemref="wrapper">
      <div itemprop="state">California</div>
    </div>
  </div>
</div>

Making extensions unsightly is probably a good thing, to discourage people from going too crazy with it. This way it's also clear which properties only apply to the extended type.

--
Philip Jägenstedt
Core Developer
Opera Software

Reply via email to