[jira] Resolved: (TIKA-140) HTML parser unable to extract text

Jukka Zitting (JIRA) Mon, 22 Sep 2008 16:04:36 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-140.
--------------------------------

    Resolution: Fixed

Resolved in a somewhat different manner in revision 698028.

Instead of adding the special "*" wildcard to the XPath matcher, I created a 
new XHTMLDowngradeHandler decorator class that makes sure that all incoming 
(X)HTML is uniformly structured.

> HTML parser unable to extract text 
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html, anynamespace.diff
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser 
> which returns an empty string when calling ParseUtils.getStringContent(). 
> Saving the same document as .txt from Firefox gives some text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-140) HTML parser unable to extract text

Reply via email to