Hi. Configure crawl-urlfilter.txt Thus you want to add something like +\.pdf$ I guess another way would be to exclude all others
Try expanding the line below with html, doc, xls, ppt, etc -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ Or try including +\.pdf$ # -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$ Followd by -. Have'nt tried it myself, but experiment some and I guess you figure it out pretty soon. Regards, Ronny -----Opprinnelig melding----- Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED] Sendt: 6. juni 2007 20:30 Til: [EMAIL PROTECTED] Emne: indexing only special documents hi! I have a question. If I have for example the seed urls and do a crawl based o that seeds. If I want to index then only pages that contain for example pdf documents, how can I do that? cheers martin !DSPAM:4666ff05259891293215062! ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
