Alexander Veremyev schrieb:
Text extraction feature is planned since first versions of Zend_Pdf and was estimated as “easy to implement”. But it’s not done up to now. The problem is in some special cases which increase implementation complexity. I mean compressed or encrypted text streams and some encoding issues.

I am not sure, what is preferable, to have implementation which doesn’t work correct for all cases or don’t have it at all (in view of existing PDF to text converting solutions).

Personally, I'd opt for doing the text processing before giving it to Lucene for indexing. Having worked at a media monitoring company in the past, I've had to extract text from PDFs quite often. There are excellent tools readily available for this task, but not all of them can cope with all types of PDF files. There are some eBook formats, for example, that have DRM encryption, or think password-protected PDFs. I have often seen PDFs that contained only images scanned from print brochures or newspapers - you can't possibly extract text from those, you'd need to extract the images and send then to an OCR software.

Also, once you start supporting other formats than plain text, why stop with PDF? Sooner or later, people would want Word documents. And it must work with each and every version of Word, of course. And OpenDocument... and... and...

IMHO it's unrealistic for Zend_Search_Lucene to support all possible text document formats, or even all subformats of PDF, for that matter. Thus, I think ZSL should stick to what it does best - indexing plain text. Leave text extraction to the applications.

Just my 0.02$.

CU
 Markus

Reply via email to