[ https://issues.apache.org/jira/browse/ANY23-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated ANY23-154: --------------------------------------- Attachment: XOYRVIbK.part neeraj.nowfloats.com.htm I attach the source HTML and a report from any23.org (any23-0.7.0-incubating) which details that in this case no microdaata extractors are called for the markup. This is truly an open issue and we need to define why the microdata extractors are not recognizing the embedded structure and being called to parse it out. > Not able to extract microdata in few test cases > ----------------------------------------------- > > Key: ANY23-154 > URL: https://issues.apache.org/jira/browse/ANY23-154 > Project: Apache Any23 > Issue Type: Bug > Components: core > Affects Versions: 0.7.0 > Environment: Windows 7 32bit > JDK 1.6.0_38 > Intel Core 2 duo and 4GB RAM > Reporter: Kunal P > Fix For: 0.9.0 > > Attachments: neeraj.nowfloats.com.htm, XOYRVIbK.part > > > we are using ApacheAny23 API for extracting microdata from the given web-page > as part of internal project. > we have some test cases where api is not able to parse the microdata. > www.neeraj.nowfloats.com (The web page is not following schema.org standards > strictly) > I am giving the snippit of the HTML code here. > <div id="someid" itemprop="offer" itemscope > itemtype="http://schema.org/Offer"> > <div ... ></div> > </div> > It clearly shows that given microdata is a child of some parent microdata > specification as it contains itemscope as well as itemprop in the same tag. > And the given <div id="someid"> tag has no parent microdata specification. > The method used for extracting ItemScopes is as follows, > import org.apache.any23.extractor.microdata.ItemScope; > import org.apache.any23.extractor.microdata.MicrodataParser; > import org.apache.any23.extractor.microdata.MicrodataParserReport; > Document dom = getDomDocument(String html) > MicrodataParserReport report = MicrodataParser.getMicrodata(dom); > ItemScope[] items = report.getDetectedItemScopes(); > here, items doesnt contain any ItemScope which has above test case. > In such scenario, how can we extract microdata from the page using any23 api. > Is there any way to relax the criterion of itemprop and itemscope not > appearing in the same tag so that we get the data from the webpage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira