On Fri, Dec 12, 2008 at 08:44:49AM +0000, Andrew Marlow wrote:
> Now that I have loaded a few PDFs into my DSpace repo, I am wondering how to
> enable full text searching. The PDFs happen to be in a form that means they
> cannot be searched directly. So when I search in DSpace I get no results

Do you mean they are bags of page images (that is, scans from paper
documents or film) rather than machine-readable text?  DSpace won't be
able to index that.  You need flat text for the indexer.

If your PDFs *do* contain machine-readable text, then there is a tool
in the [DSpace]/bin directory that can extract plain text to an
alternate bundle, which the indexer will look for.  In fact,
bin/filter-media will by default invoke the indexer when it has
finished.  See Architecture | Application Layer | Media Filters (in
the 1.4.x documentation) for information on how to run the filter tool.

Another poster has advised on ways to make scanned documents in PDFs
machine-readable.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Friends don't let friends publish revisable-form documents.

Attachment: pgps04ShIugTw.pgp
Description: PGP signature

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to