Have not 0.9 version this property?

-----Mensaje original-----
De: Neera Sharma [mailto:[email protected]] 
Enviado el: Viernes, 21 de Agosto de 2009 03:32 p.m.
Para: [email protected]
Asunto: Re: urlFilter

If you are using Nutch-1.0 you could try ignore.external.links
parameters in nutch-default.xml.


Neera



On Fri, Aug 21, 2009 at 5:48 AM, Jair Piedrahita
Vargas<[email protected]> wrote:
> Hello,
>
> I have a problem crawling pages from a intranet. I would like crawl just the 
> pages that are in the intranet *.intranet.bancolombia.com.co, but when I see 
> the crawl process I see other pages that are linked from mine.
> It's suppoused that when I put the line "-.", the crawl will skip everything 
> else, but it is not doing that.
> This is my crawl-urlfilter file. What could be the problem?
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/
>
> # skip everything else
> -.
>
> Saludos,
>
> Jair Piedrahíta Vargas
> Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías
> Dirección de Estrategia y Arquitectura
> Vicepresidencia de Tecnología de Información
> BANCOLOMBIA S.A.
> www.bancolombia.com<http://www.bancolombia.com>
> Tel: (++ 57) (4) 40 41 632
> Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198
> E-mail: [email protected]<mailto:[email protected]>
> Cra. 48 # 26 - 85 Av. Los Industriales
> Torre Norte Piso 6B -  120 (Medellín, Colombia)
> ____________________________________________________
> Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00)
>
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y 
> confidencial. Si usted no es el destinatario real del mismo, por favor 
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está 
> prohibida su retención, grabación, utilización o divulgación con cualquier 
> propósito. Este mensaje ha sido verificado con software antivirus; en 
> consecuencia, el remitente de éste no se hace responsable por la presencia en 
> él o en sus anexos de algún virus que pueda generar daños en los equipos o 
> programas del destinatario.
> ******************************************************************************************************
> This communication (including all attachments) may contain information that 
> is private, confidential and privileged. If you have received this 
> communication in error; please notify the sender immediately, delete this 
> communication from all data storage devices and destroy all hard copies. Any 
> use, dissemination, distribution, copying or disclosure of this message and 
> any attachments, in whole or in part, by anyone other than the intended 
> recipient(s) is strictly prohibited. This message has been checked with an 
> antivirus software; accordingly, the sender is not liable for the presence of 
> any virus in attachments that causes or may cause damage to the recipient's 
> equipment or software.
>

Reply via email to