Hey Ken,

Super +1, this sounds like a great idea.

Cheers,
Chris

On Sep 22, 2011, at 6:23 PM, Ken Krugler wrote:

> We were recently using Tika to process HTML pages that might have Open Graph 
> meta tags.
> 
> The issue is that these tags get stripped out, and also aren't put into the 
> metadata map.
> 
> The reason why is that Open Graph uses RDFa
> 
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> 
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these 
> tags can't be emitted.
> 
> But we could put them into the metadata map, by adding another test in the 
> HtmlHandler code that currently has:
> 
>            if ("META".equals(name) && atts.getValue("content") != null) {
>                // TIKA-478: For cases where we have either a name or
>                // "http-equiv", assume that XHTMLContentHandler will emit
>                // these in the <head>, thus passing them through safely.
>                if (atts.getValue("http-equiv") != null) {
>                    addHtmlMetadata(
>                            atts.getValue("http-equiv"),
>                            atts.getValue("content"));
>                } else if (atts.getValue("name") != null) {
>                    // Record the meta tag in the metadata
>                    addHtmlMetadata(
>                            atts.getValue("name"),
>                            atts.getValue("content"));
>                }
> 
> If we catch the case of having no name=xxx attribute, but there is a 
> property=xxx, then that would take a tag like:
> 
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/"; />
> 
> and put it into the metadata map as "og:url" => 
> "http://www.imdb.com/title/tt0117500/";
> 
> Thoughts on this?
> 
> Thanks,
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to