[
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677326#comment-17677326
]
Nick Burch commented on TIKA-3703:
----------------------------------
A zip file gives you compression, and most clients won't accidentally try to
buffer it in memory. JSON with base-64 encoded data is negative compression,
and a high risk of clients OOM-ing due to trying to fit all of the raw JSON and
parsed JSON in memory at once
(If it was just thumbnails then I could see some advantages of JSON, but it
also works on container formats with potentially huge contents)
In terms of recursion, I think it should be off on the default endpoint (as
now), but with another that supports it. Maybe eg {{/unpack}} and
{{/unpack/recursive}} ?
> Consider adding a frictionless data package output format
> ---------------------------------------------------------
>
> Key: TIKA-3703
> URL: https://issues.apache.org/jira/browse/TIKA-3703
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for
> thumbnails, or embedded images or embedded files or rendered pages, it would
> be great to return that data in a standard format. Our current /unpack
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these
> byte streams as base64 encoded metadata values in our current metadata
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)