I did it... it Works! Thanks Neera and Vishal..

Jair

-----Mensaje original-----
De: vishal vachhani [mailto:[email protected]]
Enviado el: Lunes, 24 de Agosto de 2009 09:20 a.m.
Para: [email protected]
Asunto: Re: urlFilter

ya it has!! check nutch-default.xml.

On Mon, Aug 24, 2009 at 5:41 PM, Jair Piedrahita Vargas <
[email protected]> wrote:

> Have not 0.9 version this property?
>
> -----Mensaje original-----
> De: Neera Sharma [mailto:[email protected]]
> Enviado el: Viernes, 21 de Agosto de 2009 03:32 p.m.
> Para: [email protected]
> Asunto: Re: urlFilter
>
> If you are using Nutch-1.0 you could try ignore.external.links
> parameters in nutch-default.xml.
>
>
> Neera
>
>
>
> On Fri, Aug 21, 2009 at 5:48 AM, Jair Piedrahita
> Vargas<[email protected]> wrote:
> > Hello,
> >
> > I have a problem crawling pages from a intranet. I would like crawl just
> the pages that are in the intranet *.intranet.bancolombia.com.co, but when
> I see the crawl process I see other pages that are linked from mine.
> > It's suppoused that when I put the line "-.", the crawl will skip
> everything else, but it is not doing that.
> > This is my crawl-urlfilter file. What could be the problem?
> >
> > # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[...@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/
> >
> > # skip everything else
> > -.
> >
> > Saludos,
> >
> > Jair Piedrahíta Vargas
> > Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías
> > Dirección de Estrategia y Arquitectura
> > Vicepresidencia de Tecnología de Información
> > BANCOLOMBIA S.A.
> > www.bancolombia.com<http://www.bancolombia.com>
> > Tel: (++ 57) (4) 40 41 632
> > Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198
> > E-mail: [email protected]<mailto:[email protected]>
> > Cra. 48 # 26 - 85 Av. Los Industriales
> > Torre Norte Piso 6B -  120 (Medellín, Colombia)
> > ____________________________________________________
> > Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00)
> >
> >
> > ________________________________
> > El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia
> en él o en sus anexos de algún virus que pueda generar daños en los equipos
> o programas del destinatario.
> >
> ******************************************************************************************************
> > This communication (including all attachments) may contain information
> that is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence
> of any virus in attachments that causes or may cause damage to the
> recipient's equipment or software.
> >
>
>


--
Thanks and Regards,
Vishal Vachhani

El contenido de este mensaje puede ser información privilegiada y confidencial. 
Si usted no es el destinatario real del mismo, por favor informe de ello a 
quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, 
grabación, utilización o divulgación con cualquier propósito. Este mensaje ha 
sido verificado con software antivirus; en consecuencia, el remitente de éste 
no se hace responsable por la presencia en él o en sus anexos de algún virus 
que pueda generar daños en los equipos o programas del destinatario.
******************************************************************************************************
This communication (including all attachments) may contain information that is 
private, confidential and privileged. If you have received this communication 
in error; please notify the sender immediately, delete this communication from 
all data storage devices and destroy all hard copies. Any use, dissemination, 
distribution, copying or disclosure of this message and any attachments, in 
whole or in part, by anyone other than the intended recipient(s) is strictly 
prohibited. This message has been checked with an antivirus software; 
accordingly, the sender is not liable for the presence of any virus in 
attachments that causes or may cause damage to the recipient's equipment or 
software.

Reply via email to