Pattern for extracting text from a rich document and an associated metadata file

Yavar Husain Wed, 04 Mar 2015 02:07:25 -0800

What is the best pattern to index the following kind of data:

HarryPotter.PDF
HarryPotter.txt


Avengers.Docx
Avengers.txt

For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).

(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.

(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.

Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.

Any help would be appreciated.

Thanks,
Yavar

Pattern for extracting text from a rich document and an associated metadata file

Reply via email to