[ 
https://issues.apache.org/jira/browse/TIKA-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677326#comment-17677326
 ] 

Nick Burch commented on TIKA-3703:
----------------------------------

A zip file gives you compression, and most clients won't accidentally try to 
buffer it in memory. JSON with base-64 encoded data is negative compression, 
and a high risk of clients OOM-ing due to trying to fit all of the raw JSON and 
parsed JSON in memory at once

(If it was just thumbnails then I could see some advantages of JSON, but it 
also works on container formats with potentially huge contents)

In terms of recursion, I think it should be off on the default endpoint (as 
now), but with another that supports it. Maybe eg {{/unpack}} and 
{{/unpack/recursive}} ?

> Consider adding a frictionless data package output format
> ---------------------------------------------------------
>
>                 Key: TIKA-3703
>                 URL: https://issues.apache.org/jira/browse/TIKA-3703
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> For those who want more than just text and metadata, e.g. bytes for 
> thumbnails, or embedded images or embedded files or rendered pages, it would 
> be great to return that data in a standard format. Our current /unpack 
> endpoint uses a zip file but with our own "standard".
> I was thinking about heading down the pure json option by including these 
> byte streams as base64 encoded metadata values in our current metadata 
> object. Not sure which is the better way to go.
> I'm opening this issue to discuss options.
>  
> Reference: [https://frictionlessdata.io/standards/#standards-toolkit]
> We'd want to make this available as an endpoint on tika-server 
> (\{{/v2/unpack}} or something else?) and as a commandline option in tika-app.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to