Hey Ken, Super +1, this sounds like a great idea.
Cheers, Chris On Sep 22, 2011, at 6:23 PM, Ken Krugler wrote: > We were recently using Tika to process HTML pages that might have Open Graph > meta tags. > > The issue is that these tags get stripped out, and also aren't put into the > metadata map. > > The reason why is that Open Graph uses RDFa > > http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090 > > Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these > tags can't be emitted. > > But we could put them into the metadata map, by adding another test in the > HtmlHandler code that currently has: > > if ("META".equals(name) && atts.getValue("content") != null) { > // TIKA-478: For cases where we have either a name or > // "http-equiv", assume that XHTMLContentHandler will emit > // these in the <head>, thus passing them through safely. > if (atts.getValue("http-equiv") != null) { > addHtmlMetadata( > atts.getValue("http-equiv"), > atts.getValue("content")); > } else if (atts.getValue("name") != null) { > // Record the meta tag in the metadata > addHtmlMetadata( > atts.getValue("name"), > atts.getValue("content")); > } > > If we catch the case of having no name=xxx attribute, but there is a > property=xxx, then that would take a tag like: > > <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> > > and put it into the metadata map as "og:url" => > "http://www.imdb.com/title/tt0117500/" > > Thoughts on this? > > Thanks, > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++