[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969173#comment-14969173 ] Tim Allison commented on TIKA-1707: --- That was a bad idea. The issue was that the first run ended with \u000b, and the split was hiding that "paragraph break" before the next run. So, how about adding the {{if line.endsWith("\u000b")}}: {noformat} if (line != null) { boolean isfirst = true; for (String fragment : line.split("\\u000b")) { if (!isfirst) { xhtml.startElement("br"); xhtml.endElement("br"); } isfirst = false; xhtml.characters(removePBreak(fragment)); } if (line.endsWith("\u000b")) { xhtml.startElement("br"); xhtml.endElement("br"); } } {noformat} > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker >Assignee: Tim Allison > Attachments: 075166.ppt, common_sl.diff, dont_trim_and_bullets.patch > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968266#comment-14968266 ] Tim Allison commented on TIKA-1707: --- Thank you! That fixed the vast majority of content diffs. There are still 6 ppts where we aren't adding a break between lines and we're getting improper concatenation of terms across lines. Should we move {{boolean isFirst = true;}} above {{for (HSLFTextRun htr : textRuns) {}}? > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker >Assignee: Tim Allison > Attachments: 075166.ppt, common_sl.diff, dont_trim_and_bullets.patch > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968118#comment-14968118 ] Andreas Beeker commented on TIKA-1707: -- I would replace it with the empty string and use the regex escape for the line break {code:java} fragment.replaceFirst("\\r$", "") {code} Apart of that, I've added a patch for bullet lists. Currently HSLF always returns false for super/subscript ... I need to change this in POI. Please comment, if it makes sense to add further markup information. > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker >Assignee: Tim Allison > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966888#comment-14966888 ] Tim Allison commented on TIKA-1707: --- [~kiwiwings], we found a regression in spacing around differently formatted runs in ppt (TIKA-1778). Do you see any problems if we don't {{trim}} here: {{noformat}} for (HSLFTextRun htr : htp.getTextRuns()) { String line = htr.getRawText(); if (line != null) { for (String fragment : line.split("\\u000b")){ xhtml.characters(fragment.trim()); ... {{noformat}} If we drop it, will we get spaces where we shouldn't? > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker >Assignee: Tim Allison > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938167#comment-14938167 ] Hudson commented on TIKA-1707: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/860/]) TIKA-1707: upgrade to POI 3.13 (tallison: [http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706079]) * trunk/CHANGES.txt * trunk/tika-parsers/pom.xml * trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java * trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java * trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker >Assignee: Tim Allison > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698406#comment-14698406 ] Andreas Beeker commented on TIKA-1707: -- The affected test cases are ok now ... I haven't tried the full fledged tika test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems to be ok with 1GB > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698383#comment-14698383 ] Nick Burch commented on TIKA-1707: -- The build is hopefully working again now. If you could re-test, that'd be wonderful! > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2
[ https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696563#comment-14696563 ] Nick Burch commented on TIKA-1707: -- Looks good so far to me! One quick thing though - any chance you could review http://tika.apache.org/contribute.html#Code_Formatting and tweak your formatting settings accordingly? (I think your IDE isn't quite right for Tika, for imports at least) > Upgrade to Apache POI 3.13 Beta 2 > - > > Key: TIKA-1707 > URL: https://issues.apache.org/jira/browse/TIKA-1707 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.9 >Reporter: Andreas Beeker > Attachments: common_sl.diff > > > In the not so far future, POI 3.13 Beta 2 will be available. > This contains a quite big change to the Powerpoint modules XSLF/HSLF, but > thankfully TIKA isn't much affected. > Please try the patch on our trunk and post side-effects. > As the work on the common_sl api hasn't been finished yet, there might be > another patch for the next POI beta version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)