[ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445415#comment-13445415
 ] 

Markus Jelsma commented on TIKA-980:
------------------------------------

No, the Any23 parser is DOM-based and the MicrodataContentHandler is SAX-based, 
it's very different. I reused the static sets *_TAGS and the checks on those 
sets, and of course the Item* classes. The ref feature is still missing, it can 
refer to future properties.

The other problem is the linked issue about HTML5 elements, and META tags being 
used in the body by large websites, and not being able to read the attributes 
of the BODY tag.
                
> MicrodataContentHandler for Apache Tika
> ---------------------------------------
>
>                 Key: TIKA-980
>                 URL: https://issues.apache.org/jira/browse/TIKA-980
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Ken Krugler
>             Fix For: 1.3
>
>         Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to