[jira] [Commented] (TIKA-241) Rar archive support

2011-09-22 Thread Christian Goeller (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113185#comment-13113185 ] Christian Goeller commented on TIKA-241: The original junrar still has some dependen

Re: Support for Open Graph meta tags

2011-09-22 Thread Mattmann, Chris A (388J)
Hey Ken, Super +1, this sounds like a great idea. Cheers, Chris On Sep 22, 2011, at 6:23 PM, Ken Krugler wrote: > We were recently using Tika to process HTML pages that might have Open Graph > meta tags. > > The issue is that these tags get stripped out, and also aren't put into the > metada

Support for Open Graph meta tags

2011-09-22 Thread Ken Krugler
We were recently using Tika to process HTML pages that might have Open Graph meta tags. The issue is that these tags get stripped out, and also aren't put into the metadata map. The reason why is that Open Graph uses RDFa http://stackoverflow.com/questions/2704942/html-validation-error-for-pro

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2011-09-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112875#comment-13112875 ] Michael McCandless commented on TIKA-712: - Maybe, until we work this out, we should

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2011-09-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112808#comment-13112808 ] Michael McCandless commented on TIKA-712: - I suppose a hackish solution would be to

[jira] [Updated] (TIKA-712) Master slide text isn't extracted

2011-09-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated TIKA-712: Attachment: TIKA-712-master-slide.xml Full master slide XML. > Master slide text isn't extra

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2011-09-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112801#comment-13112801 ] Michael McCandless commented on TIKA-712: - Good idea! Nice how approachable OOXML i

[jira] [Resolved] (TIKA-552) Further improvements to Word .doc and .docx parsing

2011-09-22 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-552. Resolution: Fixed Resolving as fixed. Let's use followup issues with tighter scopes for further impr

[jira] [Resolved] (TIKA-508) HtmlParser link processing should skip usemap and codebase attributes

2011-09-22 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-508. Resolution: Fixed I removed codebase and the related data and classid attributes from the URL_ATTRIB

Jenkins build became unstable: Tika-trunk #642

2011-09-22 Thread Apache Jenkins Server
See

buildbot failure in ASF Buildbot on tika-trunk

2011-09-22 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/513 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: isis_ubuntu Build Reason: scheduler Build Source Stamp: [

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

2011-09-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112756#comment-13112756 ] Nick Burch commented on TIKA-720: - Here's the thread (no replies yet...) on the ICU4J mailin

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

2011-09-22 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112668#comment-13112668 ] Michael McCandless commented on TIKA-720: - Thanks Nick! I'll see if I can find some

[jira] [Commented] (TIKA-720) EBCDIC encoding not detected

2011-09-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112592#comment-13112592 ] Nick Burch commented on TIKA-720: - I've spent a bit of time studying the code (which comes f

[jira] [Issue Comment Edited] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112531#comment-13112531 ] Jukka Zitting edited comment on TIKA-727 at 9/22/11 1:11 PM: - bq

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112531#comment-13112531 ] Jukka Zitting commented on TIKA-727: .bq   Note that the XML serializer will automatica

[jira] [Issue Comment Edited] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112496#comment-13112496 ] Pablo Queixalos edited comment on TIKA-727 at 9/22/11 12:27 PM: --

[jira] [Issue Comment Edited] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112496#comment-13112496 ] Pablo Queixalos edited comment on TIKA-727 at 9/22/11 12:12 PM: --

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112496#comment-13112496 ] Pablo Queixalos commented on TIKA-727: -- Great ! (i) The non-breaking-space entities in

[jira] [Issue Comment Edited] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112496#comment-13112496 ] Pablo Queixalos edited comment on TIKA-727 at 9/22/11 12:02 PM: --

[jira] [Commented] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112476#comment-13112476 ] Nick Burch commented on TIKA-727: - Thanks for this, applied with some tweaks in r1174056. L

RE: HSLFExtractor & POI : Looking for better XHTML

2011-09-22 Thread Pablo Queixalos
Thank you for your answers. I created the related JIRA entry https://issues.apache.org/jira/browse/TIKA-727 Pablo. -Message d'origine- De : Nick Burch [mailto:nick.bu...@alfresco.com] Envoyé : jeudi 22 septembre 2011 11:55 À : dev@tika.apache.org Objet : Re: HSLFExtractor & POI : Lookin

[jira] [Updated] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Queixalos updated TIKA-727: - Attachment: HSLFExtractor.java Parser implementation based on what the POI PowerPointExtractor does

[jira] [Created] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2011-09-22 Thread Pablo Queixalos (JIRA)
Improve the outputed XHTML by HSLFExtractor --- Key: TIKA-727 URL: https://issues.apache.org/jira/browse/TIKA-727 Project: Tika Issue Type: Improvement Components: parser Affects Versions

[jira] [Commented] (TIKA-241) Rar archive support

2011-09-22 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112455#comment-13112455 ] Nick Burch commented on TIKA-241: - Would one of our Maven gurus be able to work on getting j

Re: HSLFExtractor & POI : Looking for better XHTML

2011-09-22 Thread Nick Burch
On Thu, 22 Sep 2011, Pablo Queixalos wrote: Based on the PowerPointExtractor implementation, I rewrote the HSLFExtractor parser. This new impl produces a better XHTML but uses the org.apache.poi.hslf POI model. If you wouldn't mind, please create a new JIRA entry for this, and upload your pat

RE: HSLFExtractor & POI : Looking for better XHTML

2011-09-22 Thread Pablo Queixalos
Oops, attachment was dropped. Here it is : http://dl.free.fr/mJ2N9wIBh/HSLFExtractor.java De : Pablo Queixalos [mailto:pablo.queixa...@polyspot.com] Envoyé : jeudi 22 septembre 2011 11:34 À : dev@tika.apache.org Objet : HSLFExtractor & POI : Looking for better XHTML Hi, The XHT

HSLFExtractor & POI : Looking for better XHTML

2011-09-22 Thread Pablo Queixalos
Hi, The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers. Based on the PowerPointExtractor impleme

[jira] [Commented] (TIKA-241) Rar archive support

2011-09-22 Thread Christian Goeller (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112415#comment-13112415 ] Christian Goeller commented on TIKA-241: The project can now be found here: https://