Hello,
I would like to find out if there is any possibility to only extract
metadata from the document (without the content).
We are using Tika to parse and index content of files in JCR repository
(in Jackrabbit; we are extending its indexing part) and would like to
split the process of extracting metadata (will be indexed immediately)
and complete file content (indexing will be postponed to a later time
into a dedicated background task).
I see in the Parser implementations for different formats that it is not
always possible to extract metadata without completely parsing the
document, but e.g. PDFParser is able to do it without parsing content.
I was trying to find the answer in the mailing list, but have not
succeeded so far.
Has anyone had similar requirements and was able to solve this (by
extending each parser, creating an own implementation of the content
handler etc.)?
I will appreciate any help as I am new to the Tika.
Kind regards
Sergiy Shyrkov