[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

ASF GitHub Bot (JIRA) Fri, 25 Mar 2016 15:01:25 -0700

    [ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212476#comment-15212476
 ]


ASF GitHub Bot commented on ANY23-247:
--------------------------------------

Github user lewismc commented on the pull request:

    https://github.com/apache/any23/pull/17#issuecomment-201537723
  
    hi @ansell OK I've added in the correct rule and fix as well as a test to 
verify that empty itemscope values are identified and fixed. 
    Whilst debugging this however the core issue persists. Reasoning for this 
is that ```RDFa11Extractor extends BaseRDFExtractor``` which inherits the 
[parser function inputstream 
parameter](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java#L105).
 This input stream is not the 'fixed' steam but the raw document. 
    The only way I can think around this is for us to 
     * refactor the 
[RDFa1.1Extractor](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java)
 such that it extends 
[TagSoupDomExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60)
 as oppose to (eventually) the 
[ContentExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44),
 or
     * undertake a mass refactoring which essentially removes the 
[ContentExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44)
 altogether... this would provide us with a much more flexible and adaptable 
extraction framework IMHO.
    
    What do you think?


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ANY23-247
>                 URL: https://issues.apache.org/jira/browse/ANY23-247
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.2
>
>
> In the following markup
> {code}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
> "http://www.w3.org/TR/html4/loose.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml"; 
> xmlns:og="http://opengraphprotocol.org/schema/"; 
> xmlns:fb="http://www.facebook.com/2008/fbml"; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product";>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
> <meta http-equiv="X-UA-Compatible" content="IE=edge" />
> <meta name="generator" content="ToolTwist" />
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

Reply via email to