Hello,

Now that I have loaded a few PDFs into my DSpace repo, I am wondering how to
enable full text searching. The PDFs happen to be in a form that means they
cannot be searched directly. So when I search in DSpace I get no results
returned (unless the text also appears in the abstract I entered manually).
If I could find a way to convert the PDF to HTML this might do the trick but
if it it, I think it would be working for the wrong reasons. According to me
limited research, the proper way to enable full text search in digital
libraries is to have the documents in XML form. This raises a few DSpace
questions.

I do not actually see anywhere in DSpace where I can upload an XML (assuming
I find a way to generate one from the PDF).

I suspect that DSpace expects to be able to perform full text searching
using the HTML rather than using XML. This would work, kindof, but with XML
I think it works a whole lot better due to the metadata in the XML. An XML
approach would require some sort of schema. I do not know of any standards
in this area.

Have I got it right/wrong? Am I barking up the wrong tree? I think I might
need a lesson from a seasoned DSpacer on how full text searching is done
when the PDFs are not searchable. Googling I find that other digital
libraries, e.g those not based on DSpace, tend to approach the problem in
their own way. For example, solutions based on Mark Logic are able to take
advantage of a Mark Logic feature where it generates the XML from the PDF
when the PDF is uploaded.

-- 
Regards,

Andrew M.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to