[
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485749#comment-13485749
]
Ray Gauss II commented on TIKA-775:
-----------------------------------
Hi Jörg,
Note that the embed.diff file attached to the issue is more current and
replaces the previous patch.txt files. I've also changed just a few things
since posting embed.diff, primarily around error handling. I'll post another
diff soon with Javadoc additions mentioned below.
1) I'm not sure exactly what you mean here. The Parser interface only
guarantees a parse method and supported types. It says nothing about requiring
the entire content to be extracted by the implementation. The parser interface
also makes no specification about how the given input stream must be read or
processed, so each implementation can do that however it sees fit. Similarly
the Embedder.embed method says nothing about requiring or preventing content
from being updated, so if a particular embedder implementation wants to update
the content itself I suppose there's no reason it couldn't.
2) This is intentionally somewhat vague (but perhaps too much so) as each
embedder may implement this slightly differently, though we should have a
suggested approach, and in general I think that approach should favor
preserving the source file's metadata unless explicitly specified. I will add
some of this to the Javadoc but for your specific questions I think the answers
would be:
- Q: Does it always update all metadata in the file, i.e. does it delete
properties that are not in the Metadata object?
- A: Embedder implementations should only attempt to update metadata fields
present in the given Metadata object
- Q: How are empty properties set?
- A: Embedder implementations should set properties as empty when the
corresponding field in the Metadata object is an empty string, i.e. ""
- Q: How do I delete properties?
- A: Embedder implementations should nullify or delete properties corresponding
to fields with a null value in the given Metadata object.
- Q: Where does the embedding take place?
- A: That's up to the embedder implementation and particular file format.
- Q: Does the embed method update properties in all metadata containers?
- A: Embedder implementations should set the property corresponding to a
particular field in the given Metadata object in all metadata containers
whenever possible and appropriate for the file format at the time. If a
particular metadata container falls out of use and/or is superseded by another
(such as IIC vs XMP for IPTC) it is up to the implementation to decide if and
when to cease embedding in the alternate container.
- Q: What happens for properties where the file format specific fields have a
fixed length or different encodings?
- A: Embedder implementations should attempt to embed as much of the metadata
as accurately as possible. An implementation may choose a strict approach and
throw an exception if a value to be embedded exceeds the length allowed or may
choose to truncate the value.
For that last one we could consider adding a second embed method to Embedder
which also accepts a boolean isStrict parameter which would allow a single
implementation to operate in a mode where it would throw exceptions on bad data
vs. doing something like truncating. Implementations could always implement
that themselves so I'm not sure we need it in the interface.
3 and 5) The client is in control of the output stream as the client is
responsible for creating it and passing it to the embed method. The Embedder
needs the given input stream to read the source data and writes the final data
with metadata embedded to the given output stream. As such, consumers of the
embed method are dictating what that output stream is, which will probably be a
temp file in most cases, and the client can refrain from an writing to the
actual source file in the case of receiving an exception. See the
ExternalEmbedderTest for an example of creating a temp file output stream for
the embedder to write to.
4) Yes, parser implementations could choose to implement the Embedder interface
as well. That was the reason for naming getSupportedEmbedTypes differently
than Parser's existing getSupportedTypes method.
If the above doesn't answer your concerns I'm more than happy to flesh things
out further.
Regards,
Ray
> Embed Capabilities
> ------------------
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
> Issue Type: Improvement
> Components: general, metadata
> Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be
> installed.
> Reporter: Ray Gauss II
> Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed.diff, tika-core-embed-patch.txt,
> tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed
> ExternalEmbedder implementation meant to be extended or configured are added.
> These classes are essentially a reverse flow of the existing Parser and
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which
> uses the default ExternalEmbedder (calls sed) to embed a value placed in
> Metadata.DESCRIPTION then verify the operation by parsing the resulting
> stream.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira