Re: Scrapy choking on malformed HTML

Paul Tremberth Tue, 07 Jan 2014 03:26:22 -0800

Hi,

indeed SgmlLinkExtractor chokes on these comments.


Alternative link extractors below will parse links from your example page 
but they don't have deny or allow parameters

>>> from scrapy.contrib.linkextractors.htmlparser import 
HtmlParserLinkExtractor
>>> len(HtmlParserLinkExtractor().extract_links(response))
102

>>> from scrapy.contrib.linkextractors.lxmlhtml import 
LxmlParserLinkExtractor
>>> len(LxmlParserLinkExtractor().extract_links(response))
102


One solution is to do something similar to SgmlLinkExtractor subclassing 
BaseSgmlLinkExtractor
(see 
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L94
)
with HtmlParserLinkExtractor or LxmlParserLinkExtractor
to accept "allow" and "deny" and be able to use it in your Rule

Hope this helps.

/Paul.

On Tuesday, January 7, 2014 10:15:21 AM UTC+1, Kimble wrote:
>
> I have a problem with scrapy choking on some HTML comments in pages. The 
> encoding on the page itself is not great as the comment it is choking on 
> has <!-- but one of the dashes is probably encoded using some sort of Word 
> encoding but when running the page through the CrawlSpider Scrapy throws an 
> exception like:
>
>           File 
> "/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/sgml.py", line 
> 29, in _extract_links
>             self.feed(response_text)
>           File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
>             self.goahead(0)
>           File "/usr/lib/python2.7/sgmllib.py", line 174, in goahead
>             k = self.parse_declaration(i)
>           File "/usr/lib/python2.7/markupbase.py", line 98, in 
> parse_declaration
>             decltype, j = self._scan_name(j, i)
>           File "/usr/lib/python2.7/markupbase.py", line 392, in _scan_name
>             % rawdata[declstartpos:declstartpos+20])
>           File "/usr/lib/python2.7/sgmllib.py", line 111, in error
>             raise SGMLParseError(message)
>         sgmllib.SGMLParseError: expected name token at 
> '<!\xe2\x80\x94-0QzVNFtk[88X5m'
>
>
> The page being crawled is:
>
> http://www.jbhifi.com.au/pro-dj/samson/studio-gt-pro-pack-sku-86905/
>
> and the sgmllinkextractor rule being used is:
>
> Rule(SgmlLinkExtractor(allow=(r'.*'), 
> deny=(r'\/corporate\/',r'\/stores\/',r'\/jobs\/',r'\/factory-scoop\/')), 
> callback='parse_item', follow=True)
>
> Is there anyway to prevent Scrapy from skipping these pages entirely and 
> continuing on tag errors? It's not like the whole page is not parsing 
> properly. There are end tags. However, it seems to be treated like a fatal 
> error and the page is skipped.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Scrapy choking on malformed HTML

Reply via email to