Re: CrawlSpider fails to follow rule for some websites

Travis Leleu Wed, 05 Nov 2014 07:16:19 -0800

I would recommend not using the CrawlSpider class if you're not using the
Rule functionality.  Just use a normal scrapy.Spider, and then override the
parse() class like you said (then obviously you have to build the logic to
identify the links to follow).


When you're writing your parse function, one neat thing: you can yield
items, and they get processed through the item pipeline, or you can yield
requests, and they get added to the request queue.


On Wed, Nov 5, 2014 at 7:02 AM, Michele Coscia <[email protected]>
wrote:

> Ok, but where? In the CrawlSpider? Should I  basically override the
> parse() function? Can I still use my rule in there, and if so, how?
> Thanks!
> Michele C
>
> Il giorno mercoledì 5 novembre 2014 09:55:28 UTC-5, Aru Sahni ha scritto:
>>
>> You can just invoke BeautifulSoup as one normally would and not use
>> Scrapy's built-in functionality.
>>
>> ~A
>>
>> On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected]>
>> wrote:
>>
>>> Bingo, that's it, you are great.
>>> So it is what exits from the Selector(response) that is the problem,
>>> because response contains the entire malformed html (as it should).
>>>
>>> I tried a little test, feeding the malformed html to Beautiful soup:
>>> lxml parser still fails, html5lib instead parses correctly. So, the
>>> question is: how do I use html5lib's parser instead of lxml in Scrapy? The
>>> documentation
>>> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml>
>>> tells me that "you can easily use BeautifulSoup
>>> <http://www.crummy.com/software/BeautifulSoup/> (or lxml
>>> <http://lxml.de/>) instead", but it doesn't say how :-)
>>>
>>> Finally: I'd dare to say that this is a bug and it should be reported as
>>> such. If any browser and html5lib can parse the page, then so should
>>> Scrapy. Do you think I should submit it on the Github page?
>>>
>>> Thanks, you have been already very helpful!
>>> Michele C
>>>
>>>
>>>
>>>
>>> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha
>>> scritto:
>>>>
>>>> Hi Michele,
>>>>
>>>> I've been investigating further in your problem and looks like the html
>>>> in http://www.mass.gov/eea/agencies/dfg/der/ is malformed.  You can
>>>> see here what part of the html is really reaching extract_links:
>>>> http://pastebin.com/6kTT5Amt (there is an </html> at the end of it).
>>>> This page has 4 html definitions.
>>>>
>>>> Hope this helps,
>>>> Kind Regards,
>>>> Rocio
>>>>
>>>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> By doing some debugging in ipdb I found out that the extract_links
>>>>> function in the class LxmlLinkExtractor is not getting the same data
>>>>> I see in the scrapy shell. While in the scrapy shell I see the correct 
>>>>> data
>>>>> inside the <body> tag, when I see at the html variable in extract_links
>>>>> I see:
>>>>>
>>>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t
>>>>>
>>>>> I *know* that both the scrapy shell and my script are getting the
>>>>> very same data from the server (checked with wireshark). So somewhere in
>>>>> between the fetching of the data and the extract_links function, the
>>>>> content of the body disappears.
>>>>>
>>>>> Someone with knowledge about the source code can tell me which
>>>>> function calls LxmlLinkExtractor's extract_links?
>>>>>
>>>>> Thanks!
>>>>> Michele C
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to