[ https://issues.apache.org/jira/browse/CONNECTORS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Schuch updated CONNECTORS-1392: -------------------------------------- Description: The Web connectors already allows to ignore robots.txt by option. With this ticket, another option is added, to allow the connector to ignore robots instructions in {{<meta name="robots ...}} tags and {{<a ... rel="nofollow" ...}} attributes. *First proposal (to be discussed)* Reuse the existing "Robots.txt usage" option in the "Robots" Tab. Rename the existing options: # Don't look at robots.txt, meta robots and rel attributes # Obey robots.txt, meta robots tags and rel attributes for data fetches only # Obey robots.txt, meta robots tags and rel attributes _(the default)_ The end user doc needs to be updated. Google ressources on robot instructions in HTML pages: [0] https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4 [1] https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3 was: The Web connectors already allows to ignore robots.txt by option. With this ticket, another option is added, to allow the connector to ignore robots instructions in {{<meta name="robots ...}} tags and {{<a ... rel="nofollow" ...}} attributes. *First proposal* Reuse the existing "Robots.txt usage" option in the "Robots" Tab. Rename the existing options: # Don't look at robots.txt, meta robots and rel attributes # Obey robots.txt, meta robots tags and rel attributes for data fetches only # Obey robots.txt, meta robots tags and rel attributes _(the default)_ The end user doc needs to be updated. Google ressources on robot instructions in HTML pages: [0] https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4 [1] https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3 > Add option for Web connector to ignore robots instructions in meta tags and > rel attributes > ------------------------------------------------------------------------------------------ > > Key: CONNECTORS-1392 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1392 > Project: ManifoldCF > Issue Type: New Feature > Components: Web connector > Reporter: Markus Schuch > > The Web connectors already allows to ignore robots.txt by option. > With this ticket, another option is added, to allow the connector to ignore > robots instructions in {{<meta name="robots ...}} tags and {{<a ... > rel="nofollow" ...}} attributes. > *First proposal (to be discussed)* > Reuse the existing "Robots.txt usage" option in the "Robots" Tab. Rename the > existing options: > # Don't look at robots.txt, meta robots and rel attributes > # Obey robots.txt, meta robots tags and rel attributes for data fetches only > # Obey robots.txt, meta robots tags and rel attributes _(the default)_ > The end user doc needs to be updated. > Google ressources on robot instructions in HTML pages: > [0] > https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4 > [1] > https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3 -- This message was sent by Atlassian JIRA (v6.3.15#6346)