We were recently using Tika to process HTML pages that might have Open Graph 
meta tags.

The issue is that these tags get stripped out, and also aren't put into the 
metadata map.

The reason why is that Open Graph uses RDFa

http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090

Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these 
tags can't be emitted.

But we could put them into the metadata map, by adding another test in the 
HtmlHandler code that currently has:

            if ("META".equals(name) && atts.getValue("content") != null) {
                // TIKA-478: For cases where we have either a name or
                // "http-equiv", assume that XHTMLContentHandler will emit
                // these in the <head>, thus passing them through safely.
                if (atts.getValue("http-equiv") != null) {
                    addHtmlMetadata(
                            atts.getValue("http-equiv"),
                            atts.getValue("content"));
                } else if (atts.getValue("name") != null) {
                    // Record the meta tag in the metadata
                    addHtmlMetadata(
                            atts.getValue("name"),
                            atts.getValue("content"));
                }

If we catch the case of having no name=xxx attribute, but there is a 
property=xxx, then that would take a tag like:

<meta property="og:url" content="http://www.imdb.com/title/tt0117500/"; />

and put it into the metadata map as "og:url" => 
"http://www.imdb.com/title/tt0117500/";

Thoughts on this?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to