Hello, I have a problem crawling pages from a intranet. I would like crawl just the pages that are in the intranet *.intranet.bancolombia.com.co, but when I see the crawl process I see other pages that are linked from mine. It's suppoused that when I put the line "-.", the crawl will skip everything else, but it is not doing that. This is my crawl-urlfilter file. What could be the problem?
# The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)intranet.bancolombia.com.co/ # skip everything else -. Saludos, Jair Piedrahíta Vargas Estudiante en Práctica - Gerencia de Investigación y Nuevas Tecnologías Dirección de Estrategia y Arquitectura Vicepresidencia de Tecnología de Información BANCOLOMBIA S.A. www.bancolombia.com<http://www.bancolombia.com> Tel: (++ 57) (4) 40 41 632 Fax: (++ 57) (4) 40 40 197 - (++ 57) (4) 40 40 198 E-mail: [email protected]<mailto:[email protected]> Cra. 48 # 26 - 85 Av. Los Industriales Torre Norte Piso 6B - 120 (Medellín, Colombia) ____________________________________________________ Horario flexible: 7:00 - 12:00 y 1:30 - 4:30 GMT (-05:00) ________________________________ El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario. ****************************************************************************************************** This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.
