[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969173#comment-14969173
 ] 

Tim Allison commented on TIKA-1707:
---

That was a bad idea.  

The issue was that the first run ended with \u000b, and the split was hiding 
that "paragraph break" before the next run.

So, how about adding the {{if line.endsWith("\u000b")}}:
{noformat}
   if (line != null) {
boolean isfirst = true;
for (String fragment : line.split("\\u000b")) {
if (!isfirst) {
xhtml.startElement("br");
xhtml.endElement("br");
}
isfirst = false;
xhtml.characters(removePBreak(fragment));
}
if (line.endsWith("\u000b")) {
xhtml.startElement("br");
xhtml.endElement("br");
}
}
{noformat}

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
>Assignee: Tim Allison
> Attachments: 075166.ppt, common_sl.diff, dont_trim_and_bullets.patch
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968266#comment-14968266
 ] 

Tim Allison commented on TIKA-1707:
---

Thank you!  That fixed the vast majority of content diffs.  There are still 6 
ppts where we aren't adding a break between lines and we're getting improper 
concatenation of terms across lines.  Should we move {{boolean isFirst = 
true;}} above {{for (HSLFTextRun htr : textRuns) {}}?

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
>Assignee: Tim Allison
> Attachments: 075166.ppt, common_sl.diff, dont_trim_and_bullets.patch
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Andreas Beeker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968118#comment-14968118
 ] 

Andreas Beeker commented on TIKA-1707:
--

I would replace it with the empty string and use the regex escape for the line 
break
{code:java}
fragment.replaceFirst("\\r$", "")
{code}

Apart of that, I've added a patch for bullet lists.
Currently HSLF always returns false for super/subscript ... I need to change 
this in POI.

Please comment, if it makes sense to add further markup information.

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
>Assignee: Tim Allison
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-10-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966888#comment-14966888
 ] 

Tim Allison commented on TIKA-1707:
---

[~kiwiwings], we found a regression in spacing around differently formatted 
runs in ppt (TIKA-1778).  Do you see any problems if we don't {{trim}} here:
{{noformat}}
for (HSLFTextRun htr : htp.getTextRuns()) {
String line = htr.getRawText();
if (line != null) {
for (String fragment : line.split("\\u000b")){
   
xhtml.characters(fragment.trim());
...
{{noformat}}

If we drop it, will we get spaces where we shouldn't?

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
>Assignee: Tim Allison
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-09-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938167#comment-14938167
 ] 

Hudson commented on TIKA-1707:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/860/])
TIKA-1707: upgrade to POI 3.13 (tallison: 
[http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706079])
* trunk/CHANGES.txt
* trunk/tika-parsers/pom.xml
* 
trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
* 
trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* 
trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java


> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
>Assignee: Tim Allison
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-08-15 Thread Andreas Beeker (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698406#comment-14698406
 ] 

Andreas Beeker commented on TIKA-1707:
--

The affected test cases are ok now ... I haven't tried the full fledged tika 
test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems 
to be ok with 1GB

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-08-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698383#comment-14698383
 ] 

Nick Burch commented on TIKA-1707:
--

The build is hopefully working again now. If you could re-test, that'd be 
wonderful!

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1707) Upgrade to Apache POI 3.13 Beta 2

2015-08-13 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696563#comment-14696563
 ] 

Nick Burch commented on TIKA-1707:
--

Looks good so far to me!

One quick thing though - any chance you could review 
http://tika.apache.org/contribute.html#Code_Formatting and tweak your 
formatting settings accordingly? (I think your IDE isn't quite right for Tika, 
for imports at least)

> Upgrade to Apache POI 3.13 Beta 2
> -
>
> Key: TIKA-1707
> URL: https://issues.apache.org/jira/browse/TIKA-1707
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.9
>Reporter: Andreas Beeker
> Attachments: common_sl.diff
>
>
> In the not so far future, POI 3.13 Beta 2 will be available.
> This contains a quite big change to the Powerpoint modules XSLF/HSLF, but 
> thankfully TIKA isn't much affected.
> Please try the patch on our trunk and post side-effects.
> As the work on the common_sl api hasn't been finished yet, there might be 
> another patch for the next POI beta version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)