Greetings,
I'm interesting in having a server based personal document library with
a few specific features and I'm trying to determine what the most
appropriate tools are to build it.
I have the following content which I wish to include in the archive:
1. A smallish collection of technical books in PDF format (around 100)
2. Many years of several different magazine subscriptions in PDF format
(probably another 100 - 200 PDFs)
3. Several years of personal documents which were scanned in and
converted to searchable PDF format (300 - 500 documents)
4. I also have local mirrors of several HTML based reference sites
I'd like to have the ability to index all of this content and search it
from a web form (so that I and a few other can reach it from multiple
locations). Here are two examples of the functionality I'm looking for:
Scenario 1. "What was that software that has all the nutritional data
and hooks up to some USDA database? I know I read about it in one of my
Linux Journals last year....."
Now I'd like to be able to pull up the webform and search for "nutrition
USDA". I'd like to restrict the search to the Linux Journal magazine
PDFs (or refine the results). I'd like results to contain context
snippets with each search result. Finally most importantly, I'd like
multiple results per PDF (or all occurrences). The last one is important
so that I can actually quickly find the right issue (in case there is
some advertisement in every issue for the last year that contains those
terms). When I click on the desired result, the PDF is downloaded by my
browser.
Scenario 2. "How much have I been paying for property taxes for the last
five years again?" (the bills are all scanned in)
In this case I'd like to search for my property identification number
(which is on the bills) and the results should show all the documents
that have it, with context. Clicking on results downloads the documents.
I assume this example is simple to achieve if example 1 can be done.
So in general, my question is - can this be done in a fairly straight
forward manner with Solr? Is there a more appropriate tool to be using
(e.g. Nutch?). Also, I have looked high and low for a free, already
baked solution which can do scenario 1 but haven't been able to find
something - so if someone knows of such a thing, please let me know.
Thanks!
-Matt