ya it has!! check nutch-default.xml. On Mon, Aug 24, 2009 at 5:41 PM, Jair Piedrahita Vargas < [email protected]> wrote:
> Have not 0.9 version this property? > > -----Mensaje original----- > De: Neera Sharma [mailto:[email protected]] > Enviado el: Viernes, 21 de Agosto de 2009 03:32 p.m. > Para: [email protected] > Asunto: Re: urlFilter > > If you are using Nutch-1.0 you could try ignore.external.links > parameters in nutch-default.xml. > > > Neera > > > > On Fri, Aug 21, 2009 at 5:48 AM, Jair Piedrahita > Vargas<[email protected]> wrote: > > Hello, > > > > I have a problem crawling pages from a intranet. I would like crawl just > the pages that are in the intranet *.intranet.bancolombia.com.co, but when > I see the crawl process I see other pages that are linked from mine. > > It's suppoused that when I put the line "-.", the crawl will skip > everything else, but it is not doing that. > > This is my crawl-urlfilter file. What could be the problem? > > > > # The url filter file used by the crawl command. > > > > # Better for intranet crawling. > > # Be sure to change MY.DOMAIN.NAME to your domain name. > > > > # Each non-comment, non-blank line contains a regular expression > > # prefixed by '+' or '-'. The first matching pattern in the file > > # determines whether a URL is included or ignored. If no pattern > > # matches, the URL is ignored. > > > > # skip file:, ftp:, & mailto: urls > > -^(file|ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > -[...@=] > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > > -.*(/.+?)/.*?\1/.*?\1/ > > > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/ > > > > # skip everything else > > -. > > > > Saludos, > > > > Jair Piedrahíta Vargas > > Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías > > Dirección de Estrategia y Arquitectura > > Vicepresidencia de Tecnología de Información > > BANCOLOMBIA S.A. > > www.bancolombia.com<http://www.bancolombia.com> > > Tel: (++ 57) (4) 40 41 632 > > Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198 > > E-mail: [email protected]<mailto:[email protected]> > > Cra. 48 # 26 - 85 Av. Los Industriales > > Torre Norte Piso 6B - 120 (Medellín, Colombia) > > ____________________________________________________ > > Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00) > > > > > > ________________________________ > > El contenido de este mensaje puede ser información privilegiada y > confidencial. Si usted no es el destinatario real del mismo, por favor > informe de ello a quien lo envía y destrúyalo en forma inmediata. Está > prohibida su retención, grabación, utilización o divulgación con cualquier > propósito. Este mensaje ha sido verificado con software antivirus; en > consecuencia, el remitente de éste no se hace responsable por la presencia > en él o en sus anexos de algún virus que pueda generar daños en los equipos > o programas del destinatario. > > > ****************************************************************************************************** > > This communication (including all attachments) may contain information > that is private, confidential and privileged. If you have received this > communication in error; please notify the sender immediately, delete this > communication from all data storage devices and destroy all hard copies. Any > use, dissemination, distribution, copying or disclosure of this message and > any attachments, in whole or in part, by anyone other than the intended > recipient(s) is strictly prohibited. This message has been checked with an > antivirus software; accordingly, the sender is not liable for the presence > of any virus in attachments that causes or may cause damage to the > recipient's equipment or software. > > > > -- Thanks and Regards, Vishal Vachhani
