Hi, On 5/23/07, Ilya Vishnevsky <[EMAIL PROTECTED]> wrote: > Hi! > Some of fetched pdf-documents are not parsed. When I use SegmentReader > the value corresponding to key "pt" in the resulting map is empty. > For example this happens with following urls: > > http://www.virtualacquisitionshowcase.com/docs/DETech-Brochure.pdf > > http://www.dtic.mil/ndia/22ndISB2005/thursday/fong.pdf > > http://www.dsto.defence.gov.au/publications/2581/DSTO-TR-1479.pdf > > http://sill-www.army.mil/FAMAG/2000/JUL_AUG_2000/JUL_AUG-2000_PAGES_36_3 > 9.pdf > > http://www.dtic.mil/ndia/2001armaments/fong.pdf > > At the same time there are pdf-files that are parsed normally. > Why this problem can occur and how can I resolve it? >
What is your http.content.limit? For example, the first url in your list is a little over 200K. So if http.content.limit is less than that (btw, by default, it is 64K) Nutch truncates content after http.content.limit. And parse-pdf can't parse partial pdf files. -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
