[Nutch-general] some pdf's are not parsed

Ilya Vishnevsky Wed, 23 May 2007 06:21:33 -0700

Hi!
Some of fetched pdf-documents are not parsed. When I use SegmentReader
the value corresponding to key "pt" in the resulting map is empty.
For example this happens with following urls:


http://www.virtualacquisitionshowcase.com/docs/DETech-Brochure.pdf

http://www.dtic.mil/ndia/22ndISB2005/thursday/fong.pdf

http://www.dsto.defence.gov.au/publications/2581/DSTO-TR-1479.pdf

http://sill-www.army.mil/FAMAG/2000/JUL_AUG_2000/JUL_AUG-2000_PAGES_36_3
9.pdf

http://www.dtic.mil/ndia/2001armaments/fong.pdf

At the same time there are pdf-files that are parsed normally.
Why this problem can occur and how can I resolve it?

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] some pdf's are not parsed

Reply via email to