ya it has!! check nutch-default.xml.

On Mon, Aug 24, 2009 at 5:41 PM, Jair Piedrahita Vargas <
[email protected]> wrote:

> Have not 0.9 version this property?
>
> -----Mensaje original-----
> De: Neera Sharma [mailto:[email protected]]
> Enviado el: Viernes, 21 de Agosto de 2009 03:32 p.m.
> Para: [email protected]
> Asunto: Re: urlFilter
>
> If you are using Nutch-1.0 you could try ignore.external.links
> parameters in nutch-default.xml.
>
>
> Neera
>
>
>
> On Fri, Aug 21, 2009 at 5:48 AM, Jair Piedrahita
> Vargas<[email protected]> wrote:
> > Hello,
> >
> > I have a problem crawling pages from a intranet. I would like crawl just
> the pages that are in the intranet *.intranet.bancolombia.com.co, but when
> I see the crawl process I see other pages that are linked from mine.
> > It's suppoused that when I put the line "-.", the crawl will skip
> everything else, but it is not doing that.
> > This is my crawl-urlfilter file. What could be the problem?
> >
> > # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[...@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/
> >
> > # skip everything else
> > -.
> >
> > Saludos,
> >
> > Jair Piedrahíta Vargas
> > Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías
> > Dirección de Estrategia y Arquitectura
> > Vicepresidencia de Tecnología de Información
> > BANCOLOMBIA S.A.
> > www.bancolombia.com<http://www.bancolombia.com>
> > Tel: (++ 57) (4) 40 41 632
> > Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198
> > E-mail: [email protected]<mailto:[email protected]>
> > Cra. 48 # 26 - 85 Av. Los Industriales
> > Torre Norte Piso 6B -  120 (Medellín, Colombia)
> > ____________________________________________________
> > Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00)
> >
> >
> > ________________________________
> > El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia
> en él o en sus anexos de algún virus que pueda generar daños en los equipos
> o programas del destinatario.
> >
> ******************************************************************************************************
> > This communication (including all attachments) may contain information
> that is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence
> of any virus in attachments that causes or may cause damage to the
> recipient's equipment or software.
> >
>
>


-- 
Thanks and Regards,
Vishal Vachhani

Reply via email to