[ 
https://issues.apache.org/jira/browse/TIKA-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077474#comment-17077474
 ] 

Tim Allison commented on TIKA-3072:
-----------------------------------

The BoilerpipeContentHandler has some overhead.  When I use the 
WriteOutContentHandler (writing to a string in memory), I get a string length 
of 6,237,493 characters and no heap problems with -Xmx512m, but I do run into 
heap problems with the WriteoutContentHandler at 256m.  

I don't _think_ this is a bug, I think it is just an inefficiency.  If you can 
find a way to improve the code, please let us know.

> Seeing org.apache.tika.exception.TikaException: Unexpected RuntimeException 
> for an XLS file
> -------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3072
>                 URL: https://issues.apache.org/jira/browse/TIKA-3072
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Muhammad Yasir Khan
>            Priority: Major
>         Attachments: 0000431.xls
>
>
> [^0000431.xls]
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@5d216317
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to