Have not 0.9 version this property? -----Mensaje original----- De: Neera Sharma [mailto:[email protected]] Enviado el: Viernes, 21 de Agosto de 2009 03:32 p.m. Para: [email protected] Asunto: Re: urlFilter
If you are using Nutch-1.0 you could try ignore.external.links parameters in nutch-default.xml. Neera On Fri, Aug 21, 2009 at 5:48 AM, Jair Piedrahita Vargas<[email protected]> wrote: > Hello, > > I have a problem crawling pages from a intranet. I would like crawl just the > pages that are in the intranet *.intranet.bancolombia.com.co, but when I see > the crawl process I see other pages that are linked from mine. > It's suppoused that when I put the line "-.", the crawl will skip everything > else, but it is not doing that. > This is my crawl-urlfilter file. What could be the problem? > > # The url filter file used by the crawl command. > > # Better for intranet crawling. > # Be sure to change MY.DOMAIN.NAME to your domain name. > > # Each non-comment, non-blank line contains a regular expression > # prefixed by '+' or '-'. The first matching pattern in the file > # determines whether a URL is included or ignored. If no pattern > # matches, the URL is ignored. > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/ > > # skip everything else > -. > > Saludos, > > Jair Piedrahíta Vargas > Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías > Dirección de Estrategia y Arquitectura > Vicepresidencia de Tecnología de Información > BANCOLOMBIA S.A. > www.bancolombia.com<http://www.bancolombia.com> > Tel: (++ 57) (4) 40 41 632 > Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198 > E-mail: [email protected]<mailto:[email protected]> > Cra. 48 # 26 - 85 Av. Los Industriales > Torre Norte Piso 6B - 120 (Medellín, Colombia) > ____________________________________________________ > Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00) > > > ________________________________ > El contenido de este mensaje puede ser información privilegiada y > confidencial. Si usted no es el destinatario real del mismo, por favor > informe de ello a quien lo envía y destrúyalo en forma inmediata. Está > prohibida su retención, grabación, utilización o divulgación con cualquier > propósito. Este mensaje ha sido verificado con software antivirus; en > consecuencia, el remitente de éste no se hace responsable por la presencia en > él o en sus anexos de algún virus que pueda generar daños en los equipos o > programas del destinatario. > ****************************************************************************************************** > This communication (including all attachments) may contain information that > is private, confidential and privileged. If you have received this > communication in error; please notify the sender immediately, delete this > communication from all data storage devices and destroy all hard copies. Any > use, dissemination, distribution, copying or disclosure of this message and > any attachments, in whole or in part, by anyone other than the intended > recipient(s) is strictly prohibited. This message has been checked with an > antivirus software; accordingly, the sender is not liable for the presence of > any virus in attachments that causes or may cause damage to the recipient's > equipment or software. >
