Ronny, your way is probably better. See, I was only dealing with the
fetched properties. But, in your case, you don't fetch it, which gets rid
of all that wasted bandwidth.
For dealing with types that can be dealt with via the file extension, this
would probably work better.
On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
Hi.
Configure crawl-urlfilter.txt
Thus you want to add something like +\.pdf$ I guess another way would be
to exclude all others
Try expanding the line below with html, doc, xls, ppt, etc
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
Or try including
+\.pdf$
#
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
Followd by
-.
Have'nt tried it myself, but experiment some and I guess you figure it
out pretty soon.
Regards,
Ronny
-----Opprinnelig melding-----
Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED]
Sendt: 6. juni 2007 20:30
Til: [EMAIL PROTECTED]
Emne: indexing only special documents
hi!
I have a question. If I have for example the seed urls and do a crawl
based o that seeds. If I want to index then only pages that contain for
example pdf documents, how can I do that?
cheers
martin
!DSPAM:4666ff05259891293215062!
--
"Conscious decisions by conscious minds are what make reality real"
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general