Hi, indeed SgmlLinkExtractor chokes on these comments.
Alternative link extractors below will parse links from your example page but they don't have deny or allow parameters >>> from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor >>> len(HtmlParserLinkExtractor().extract_links(response)) 102 >>> from scrapy.contrib.linkextractors.lxmlhtml import LxmlParserLinkExtractor >>> len(LxmlParserLinkExtractor().extract_links(response)) 102 One solution is to do something similar to SgmlLinkExtractor subclassing BaseSgmlLinkExtractor (see https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L94 ) with HtmlParserLinkExtractor or LxmlParserLinkExtractor to accept "allow" and "deny" and be able to use it in your Rule Hope this helps. /Paul. On Tuesday, January 7, 2014 10:15:21 AM UTC+1, Kimble wrote: > > I have a problem with scrapy choking on some HTML comments in pages. The > encoding on the page itself is not great as the comment it is choking on > has <!-- but one of the dashes is probably encoded using some sort of Word > encoding but when running the page through the CrawlSpider Scrapy throws an > exception like: > > File > "/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/sgml.py", line > 29, in _extract_links > self.feed(response_text) > File "/usr/lib/python2.7/sgmllib.py", line 104, in feed > self.goahead(0) > File "/usr/lib/python2.7/sgmllib.py", line 174, in goahead > k = self.parse_declaration(i) > File "/usr/lib/python2.7/markupbase.py", line 98, in > parse_declaration > decltype, j = self._scan_name(j, i) > File "/usr/lib/python2.7/markupbase.py", line 392, in _scan_name > % rawdata[declstartpos:declstartpos+20]) > File "/usr/lib/python2.7/sgmllib.py", line 111, in error > raise SGMLParseError(message) > sgmllib.SGMLParseError: expected name token at > '<!\xe2\x80\x94-0QzVNFtk[88X5m' > > > The page being crawled is: > > http://www.jbhifi.com.au/pro-dj/samson/studio-gt-pro-pack-sku-86905/ > > and the sgmllinkextractor rule being used is: > > Rule(SgmlLinkExtractor(allow=(r'.*'), > deny=(r'\/corporate\/',r'\/stores\/',r'\/jobs\/',r'\/factory-scoop\/')), > callback='parse_item', follow=True) > > Is there anyway to prevent Scrapy from skipping these pages entirely and > continuing on tag errors? It's not like the whole page is not parsing > properly. There are end tags. However, it seems to be treated like a fatal > error and the page is skipped. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
