Alexander Veremyev schrieb:
Text extraction feature is planned since first versions of Zend_Pdf and
was estimated as “easy to implement”. But it’s not done up to now. The
problem is in some special cases which increase implementation
complexity. I mean compressed or encrypted text streams and some
encoding issues.
I am not sure, what is preferable, to have implementation which doesn’t
work correct for all cases or don’t have it at all (in view of existing
PDF to text converting solutions).
Personally, I'd opt for doing the text processing before giving it to
Lucene for indexing. Having worked at a media monitoring company in the
past, I've had to extract text from PDFs quite often. There are
excellent tools readily available for this task, but not all of them can
cope with all types of PDF files. There are some eBook formats, for
example, that have DRM encryption, or think password-protected PDFs. I
have often seen PDFs that contained only images scanned from print
brochures or newspapers - you can't possibly extract text from those,
you'd need to extract the images and send then to an OCR software.
Also, once you start supporting other formats than plain text, why stop
with PDF? Sooner or later, people would want Word documents. And it must
work with each and every version of Word, of course. And OpenDocument...
and... and...
IMHO it's unrealistic for Zend_Search_Lucene to support all possible
text document formats, or even all subformats of PDF, for that matter.
Thus, I think ZSL should stick to what it does best - indexing plain
text. Leave text extraction to the applications.
Just my 0.02$.
CU
Markus