Re: [Dspace-tech] searching, PDFs, HTML and XML

Brian Freels-Stendel Fri, 12 Dec 2008 12:26:15 -0800

It should be able to filter these files.  The OCR (text) is kept in a different 
'layer':  http://www.dclab.com/pdfconversion3.asp.


Caveat is that there may be more ways to include OCR information that I don't 
know about.

B--

>>> On 12/12/2008 at 1:11 PM, in message
<[email protected]>, "Thornton,
Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]" <[email protected]>
wrote:

>      Question:  If a .pdf document contains, let's say, 1 page in the
> middle of a document that contains an image (a drawing for instance), is
> filter-media going to fail on the filtering of this document or will it
> just skip the image and continue to filter what it can?  
> 
>      I have made some modifications to the DSpace 1.4.2 filter-media
> process so that if a document cannot be filtered for whatever reason
> (unreadable characters, java heap space error, etc), the bitstream_id
> for that document gets written to a local table.  Before
> MediaFilterManager.java even attempts to filter a document, it checks
> that local table to see if that bitstream_id exists in the table.  If it
> does, it will not even *attempt* to filter that document and instead
> increments a counter and a last-date-skipped column in the local table.
> A periodic report is sent to the Users and they inspect the document to
> see if it didn't OCR correctly, etc.  If appropriate, they will rescan
> the original document, delete the old document, and upload the new
> document into DSpace.  Since the new document has a new bitstream_id,
> filter-media will attempt to filter it that night and the process
> repeats.
> 
>      The best thing about this mod is that it can save hours of
> processing time, especially with documents where a previous filtering
> attempt has resulted in a Java heap space error.  Sometimes the
> filtering attempt will actually run for hours before it fails with the
> Java heap space error on a document.  Simply adding the bitstream_id for
> this document to our local table will eliminate a subsequent filtering
> attempt and filter-media runs and completes much faster.
> 
>      I would be happy to share this code with anyone who is interested.
> 
>      Please let me know if anyone can answer my question about filtering
> results with a .pdf document that contains 1 or more unfilterable
> images.
> 
> Thanks in advance,
> 
> Sue Walker-Thornton
> ConITS Contract
> NASA Langley Research Center
> Integrated Library Systems Application & Database Administrator
> 130 Research Drive
> Hampton, VA  23666
> Office: (757) 224-4074
> Fax:    (757) 224-4001
> Pager: (757) 988-2547 
> Email:  [email protected] 
> 
> 
> -----Original Message-----
> From: Shane Beers [mailto:[email protected]] 
> Sent: Friday, December 12, 2008 10:31 AM
> To: Andrew Marlow
> Cc: [email protected] 
> Subject: Re: [Dspace-tech] searching, PDFs, HTML and XML
> 
> Andrew:
> Performing OCR on a PDF document is, as far as I know, the most widely  
> used method to search a PDF document. Is there a specific reason you  
> do not want the PDFs to be searchable? Even the archival "standard" of  
> PDF/A (archival PDF) allows for OCR.
> 
> I use the commercial product ABBYY Finereader for a variety of  
> solutions. From their web site: "When you are converting documents for  
> editing, ABBYY FineReader 9.0 exports the results directly to your  
> favorite applications including Microsoft Word, Microsoft Excel,  
> Microsoft PowerPoint, and Adobe Acrobat/Reader. In addition,  
> recognized text can be saved in a variety of file formats, including  
> PDF, PDF/A, HTML, Microsoft Word XML, DOC/DOCX, RTF, XLS/ XLSX, PPT,  
> DBF, CSV, TXT, and LIT. "
> 
> It looks like this would be able to fit your needs. However, I would  
> be of the opinion that just performing OCR would be the most direct  
> and stable option.
> 
> Addtionally, you can upload multiple bitstreams per item in DSpace.  
> The first page of the ingest process asks if the item contains  
> multiple files, and you would answer in the affirmative. Additionally,  
> you can edit individual items bitstreams as an admin after they are  
> already in the archive.
> 
> Shane Beers
> Digital Repository Services Librarian
> George Mason University
> [email protected] 
> http://mars.gmu.edu 
> 703-993-3742
> 
> 
> 
> On Dec 12, 2008, at 3:44 AM, Andrew Marlow wrote:
> 
>> Hello,
>>
>> Now that I have loaded a few PDFs into my DSpace repo, I am  
>> wondering how to enable full text searching. The PDFs happen to be  
>> in a form that means they cannot be searched directly. So when I  
>> search in DSpace I get no results returned (unless the text also  
>> appears in the abstract I entered manually). If I could find a way  
>> to convert the PDF to HTML this might do the trick but if it it, I  
>> think it would be working for the wrong reasons. According to me  
>> limited research, the proper way to enable full text search in  
>> digital libraries is to have the documents in XML form. This raises  
>> a few DSpace questions.
>>
>> I do not actually see anywhere in DSpace where I can upload an XML  
>> (assuming I find a way to generate one from the PDF).
>>
>> I suspect that DSpace expects to be able to perform full text  
>> searching using the HTML rather than using XML. This would work,  
>> kindof, but with XML I think it works a whole lot better due to the  
>> metadata in the XML. An XML approach would require some sort of  
>> schema. I do not know of any standards in this area.
>>
>> Have I got it right/wrong? Am I barking up the wrong tree? I think I  
>> might need a lesson from a seasoned DSpacer on how full text  
>> searching is done when the PDFs are not searchable. Googling I find  
>> that other digital libraries, e.g those not based on DSpace, tend to  
>> approach the problem in their own way. For example, solutions based  
>> on Mark Logic are able to take advantage of a Mark Logic feature  
>> where it generates the XML from the PDF when the PDF is uploaded.
>>
>> -- 
>> Regards,
>>
>> Andrew M.
>>
> ------------------------------------------------------------------------
> ------
>> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,  
>> Nevada.
>> The future of the web can't happen without you.  Join us at MIX09 to  
>> help
>> pave the way to the Next Web now. Learn more and register at
>>
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.
> com/_______________________________________________
>> DSpace-tech mailing list
>> [email protected] 
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech 
> 
> 
> ------------------------------------------------------------------------
> ------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
> Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to
> help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.
> com/
> _______________________________________________
> DSpace-tech mailing list
> [email protected] 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech 
> 
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ 
> _______________________________________________
> DSpace-tech mailing list
> [email protected] 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] searching, PDFs, HTML and XML

Reply via email to