Help storing + highlighting search results in PDF newspapers

Colin 't Hart Fri, 11 Sep 2015 08:48:43 -0700

Hi,

I'm having trouble negotiating the steep Solr learning curve...


1. I'm trying to store scanned and OCRed newspapers in PDF format into Solr
for full-text searching.
I've tried most (all?) of the examples and sample configurations that come
with Solr 5.3.0 and I can upload the PDFs.
Searching works, but for the life of me I can't get highlights in the
results.

I tried setting the "store" attribute of the "_text_" and/or "content"
fields to "true" but that didn't help -- just increased the size of the
query response -- and lots of PDF data appeared in the response instead of
just the text -- but the "highlighting" section of the response was still
virtually empty (just lists matching documents, but no highlighted text
fragments).


Can someone point me in the direction of a sample config that will work?


2. After that's working I'd like to trim this down to a minimal schema with
just

* title
* date
* volume
* number
* URL (the PDFs themselves will be made available online for viewing using
the same viewer.js that's embedded in Firefox)

as metadata (as well as the required metadata such as id and _version_).

I want to extract these metadata fields from the filenames -- I presume
that's also possible? Can someone point me to how I would go about doing
this too?


3. The newspapers are in Swedish. I've found the Swedish stopwords list;
are there any other dictionaries etc available to assist with queries where
words have different forms for eg plurals, eg "flicka" (girl), "flickor"
(girls)?



Much thanks in advance!

Regards,

Colin

Help storing + highlighting search results in PDF newspapers

Reply via email to