[
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485665#comment-13485665
]
Jörg Ehrlich commented on TIKA-775:
-----------------------------------
Hi Ray,
I think it would be great if Tika could also write Metadata back to files and
it would be great to start on this rather sooner than later.
But I have a couple of comments regarding your proposed implementation:
1) Right now the Parsers do both content and metadata extraction. The proposed
embedder does only Metadata embedding, which is fine because updating of
content would be out of scope for Tika.
But if we introduce separate APIs to embed just metadata I think it would make
sense to also introduce APIs to only extract metadata. Actually at Adobe we had
stop using Tika to retrieve Metadata from specific file formats because it
always parses the whole content which is simply too heavy an operation to scale
in a larger system.
So I planned to get started on a new API and adjustments to parsers to just
retrieve Metadata from files, but did not have time for this, yet. I guess it
would make sense to synchronize these two new APIs, right?
Being able to just parse Metadata from files is actually also very important
for the embedding of it, which I will explain further down.
2) Your documentation does not really specify in detail the behavior of the
metadata update that should happen.
Does it always update all metadata in the file, i.e. does it delete properties
that are not in the Metadata object? Or does it only update those properties
that are provided in the Metadata object? How do I delete properties then? Do I
make the property empty? But empty properties are in most metadata containers a
valid property value and should not delete the property.
Where does the embedding take place? A lot of file formats have several
metadata containers with similar properties. Does the embed method update all
of them? Or just the ones, the parsers were looking at? What happens in case of
inconsistencies? Do you read/write from specific fields or do you reconcile all
of them together?
What happens for properties where the file format specific fields have a fixed
length or different encodings? Do you just write as much as possible and the
rest is simply ignored?
For all such questions, you have to think about whether it makes sense to
provide the client with the ability to either configure the embedder or provide
a callback API for the client to decide if specific scenarios arise or if the
embedder should always just do a best guess for the client.
In any such case, it is usually for the client important to get the original
metadata from the file, before writing it back, so that no properties are
wrongly deleted or changed. But even more so it is important for the Embedder
as it would in most cases have to read the metadata anyway, in order to know
how to update the file properly. It usually has to check if an in-place update
of metadata can happen or if the whole file has to be restructured because the
metadata chunks have grown too large to fit where they were before.
That's why I think it would be important to have a get-only-metadata API and
Parser capabilities available, before starting writing it back.
3) This also leads me to the topic of error recovery and safe updating of
files. I think the documentation should be more clear about what the Embedder
will do in case of an error and what is expected by the client.
There are all sorts of reasons the embedding could fail. If that happens, the
original file usually ends up being corrupt and lost for the user. So it
usually makes sense (for samller files) to do a safe update, which means
writing the update in a new file and then swap it with the original one, after
the update was successful.
But what about scenarios where a partial update is possible? You often have
files where just specific metadata sections are corrupt because some tool did
not read the spec and wrote it wrongly. But the rest of the file is still ok,
so other parts could still be updated. Do you want to provide a callback API
for the client to be able to react to error scenarios and decide what he wants
to do? The embedder could do a best guess action, but that is usually quite
dangerous for the user's files.
4) I take it that the expectation is that all parsers could also potentially
implement the Embedder interface, so that both reading and writing is in one
hand? Otherwise you probably end up with all sorts of inconsistencies between
the two implementations regarding what metadata fields are read from where and
what should be updated when, etc.
5) Why do you pass in an InputStream? That would mean the Embedder has to open
up an own OutputStream to be able to write. That would imply that Tika knows
how to properly create OutputStreams in the client's environment. Wouldn't it
be better to leave the client in control here? And why do you want to return
the InputStream?
6) I also agree with Jukka's comments that for such an important new feature we
should spend some more thoughts on this. I think your proposal works ok for the
external embedder scenario but I am not so sure for other scenarios.
Sorry that I did not speak up earlier. This issue has been around for quite a
while.
Regards
Jörg
> Embed Capabilities
> ------------------
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
> Issue Type: Improvement
> Components: general, metadata
> Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be
> installed.
> Reporter: Ray Gauss II
> Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed.diff, tika-core-embed-patch.txt,
> tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed
> ExternalEmbedder implementation meant to be extended or configured are added.
> These classes are essentially a reverse flow of the existing Parser and
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which
> uses the default ExternalEmbedder (calls sed) to embed a value placed in
> Metadata.DESCRIPTION then verify the operation by parsing the resulting
> stream.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira