Re: [Nutch-general] indexing only special documents

Naess, Ronny Thu, 07 Jun 2007 01:19:08 -0700

 
Hi.

Configure crawl-urlfilter.txt
Thus you want to add something like +\.pdf$ I guess another way would be
to exclude all others


Try expanding the line below with html, doc, xls, ppt, etc
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$

Or try including 
+\.pdf$
#
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
Followd by
-.

Have'nt tried it myself, but experiment some and I guess you figure it
out pretty soon.

Regards,
Ronny 

-----Opprinnelig melding-----
Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED]

Sendt: 6. juni 2007 20:30
Til: [EMAIL PROTECTED]
Emne: indexing only special documents



hi!

I have a question. If I have for example the seed urls and do a crawl
based o that seeds. If I want to index then only pages that contain for
example pdf documents, how can I do that?

cheers
martin



!DSPAM:4666ff05259891293215062!


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] indexing only special documents

Reply via email to