[
https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316845#comment-16316845
]
Manolo Caracuel commented on TIKA-2542:
---------------------------------------
[[email protected]], I tried this and I get OOM with moderately sized
documents (on very quick look, because it seems to buffer everything in a
StrinBuilder first). With my change it works ok, as it seems to be streaming
the generated output as it becomes available. I need to be able to parse large
files.
> Support in tika-server for getting plain text and metadata at the same time
> ---------------------------------------------------------------------------
>
> Key: TIKA-2542
> URL: https://issues.apache.org/jira/browse/TIKA-2542
> Project: Tika
> Issue Type: Improvement
> Components: core, server
> Affects Versions: 1.17
> Reporter: Manolo Caracuel
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.18
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> It would be good to have a way to get a files plain text extracted and also
> get the metadata detected. Currently you can only get the metadata if the
> request has Accepts of text/xml or text/html but then the text in the body is
> not the plain text as it contains html elements as well.
> I propose that when requesting /tika/plain with Accepts header of text/xml,
> an xhtml document is returned with the metadata in head's meta elements and
> the plain text in the body.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)