Hello,

I would like to find out if there is any possibility to only extract metadata from the document (without the content). We are using Tika to parse and index content of files in JCR repository (in Jackrabbit; we are extending its indexing part) and would like to split the process of extracting metadata (will be indexed immediately) and complete file content (indexing will be postponed to a later time into a dedicated background task). I see in the Parser implementations for different formats that it is not always possible to extract metadata without completely parsing the document, but e.g. PDFParser is able to do it without parsing content. I was trying to find the answer in the mailing list, but have not succeeded so far.

Has anyone had similar requirements and was able to solve this (by extending each parser, creating an own implementation of the content handler etc.)?
I will appreciate any help as I am new to the Tika.

Kind regards
Sergiy Shyrkov

Reply via email to