We were recently using Tika to process HTML pages that might have Open Graph meta tags.
The issue is that these tags get stripped out, and also aren't put into the metadata map. The reason why is that Open Graph uses RDFa http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090 Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted. But we could put them into the metadata map, by adding another test in the HtmlHandler code that currently has: if ("META".equals(name) && atts.getValue("content") != null) { // TIKA-478: For cases where we have either a name or // "http-equiv", assume that XHTMLContentHandler will emit // these in the <head>, thus passing them through safely. if (atts.getValue("http-equiv") != null) { addHtmlMetadata( atts.getValue("http-equiv"), atts.getValue("content")); } else if (atts.getValue("name") != null) { // Record the meta tag in the metadata addHtmlMetadata( atts.getValue("name"), atts.getValue("content")); } If we catch the case of having no name=xxx attribute, but there is a property=xxx, then that would take a tag like: <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/" Thoughts on this? Thanks, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr