Re: [fw-general] Zend_Lucene_Search for PDFs

Markus Wolff Thu, 01 Nov 2007 09:37:35 -0800

Alexander Veremyev schrieb:

Text extraction feature is planned since first versions of Zend_Pdf andwas estimated as “easy to implement”. But it’s not done up to now. Theproblem is in some special cases which increase implementationcomplexity. I mean compressed or encrypted text streams and someencoding issues.
I am not sure, what is preferable, to have implementation which doesn’twork correct for all cases or don’t have it at all (in view of existingPDF to text converting solutions).

Personally, I'd opt for doing the text processing before giving it toLucene for indexing. Having worked at a media monitoring company in thepast, I've had to extract text from PDFs quite often. There areexcellent tools readily available for this task, but not all of them cancope with all types of PDF files. There are some eBook formats, forexample, that have DRM encryption, or think password-protected PDFs. Ihave often seen PDFs that contained only images scanned from printbrochures or newspapers - you can't possibly extract text from those,you'd need to extract the images and send then to an OCR software.

Also, once you start supporting other formats than plain text, why stopwith PDF? Sooner or later, people would want Word documents. And it mustwork with each and every version of Word, of course. And OpenDocument...and... and...

IMHO it's unrealistic for Zend_Search_Lucene to support all possibletext document formats, or even all subformats of PDF, for that matter.Thus, I think ZSL should stick to what it does best - indexing plaintext. Leave text extraction to the applications.


Just my 0.02$.

CU
 Markus

Re: [fw-general] Zend_Lucene_Search for PDFs

Reply via email to