Re: CrawlSpider fails to follow rule for some websites

Michele Coscia Wed, 05 Nov 2014 07:55:35 -0800

Alternatively, how do I control a normal Spider from a Python script? In 
the doc there is only a script for controlling a CrawlSpider. If I try 
spider.start_requests(), nothing happens.
Thanks!
Michele C


Il giorno mercoledì 5 novembre 2014 10:48:31 UTC-5, Michele Coscia ha 
scritto:
>
> What if I do need to use CrawlSpider? After all I *was* using the Rule 
> functionality, plus several other things that are needed in my script and 
> that I cannot pass as simple arguments to "scrapy crawl xxx"?
> Thanks!
> Michele C
>
> Il giorno mercoledì 5 novembre 2014 10:15:31 UTC-5, Travis Leleu ha 
> scritto:
>>
>> I would recommend not using the CrawlSpider class if you're not using the 
>> Rule functionality.  Just use a normal scrapy.Spider, and then override the 
>> parse() class like you said (then obviously you have to build the logic to 
>> identify the links to follow).
>>
>> When you're writing your parse function, one neat thing: you can yield 
>> items, and they get processed through the item pipeline, or you can yield 
>> requests, and they get added to the request queue.
>>
>>
>> On Wed, Nov 5, 2014 at 7:02 AM, Michele Coscia <[email protected]> 
>> wrote:
>>
>>> Ok, but where? In the CrawlSpider? Should I  basically override the 
>>> parse() function? Can I still use my rule in there, and if so, how?
>>> Thanks!
>>> Michele C
>>>
>>> Il giorno mercoledì 5 novembre 2014 09:55:28 UTC-5, Aru Sahni ha scritto:
>>>>
>>>> You can just invoke BeautifulSoup as one normally would and not use 
>>>> Scrapy's built-in functionality.
>>>>
>>>> ~A
>>>>
>>>> On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected]> 
>>>> wrote:
>>>>
>>>>> Bingo, that's it, you are great.
>>>>> So it is what exits from the Selector(response) that is the problem, 
>>>>> because response contains the entire malformed html (as it should).
>>>>>
>>>>> I tried a little test, feeding the malformed html to Beautiful soup: 
>>>>> lxml parser still fails, html5lib instead parses correctly. So, the 
>>>>> question is: how do I use html5lib's parser instead of lxml in Scrapy? 
>>>>> The 
>>>>> documentation 
>>>>> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml>
>>>>>  
>>>>> tells me that "you can easily use BeautifulSoup 
>>>>> <http://www.crummy.com/software/BeautifulSoup/> (or lxml 
>>>>> <http://lxml.de/>) instead", but it doesn't say how :-)
>>>>>
>>>>> Finally: I'd dare to say that this is a bug and it should be reported 
>>>>> as such. If any browser and html5lib can parse the page, then so should 
>>>>> Scrapy. Do you think I should submit it on the Github page?
>>>>>
>>>>> Thanks, you have been already very helpful!
>>>>> Michele C
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha 
>>>>> scritto:
>>>>>>
>>>>>> Hi Michele,
>>>>>>
>>>>>> I've been investigating further in your problem and looks like the 
>>>>>> html in http://www.mass.gov/eea/agencies/dfg/der/ is malformed.  You 
>>>>>> can see here what part of the html is really reaching extract_links: 
>>>>>> http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). 
>>>>>> This page has 4 html definitions.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Kind Regards,
>>>>>> Rocio
>>>>>>
>>>>>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> By doing some debugging in ipdb I found out that the extract_links 
>>>>>>> function in the class LxmlLinkExtractor is not getting the same 
>>>>>>> data I see in the scrapy shell. While in the scrapy shell I see the 
>>>>>>> correct 
>>>>>>> data inside the <body> tag, when I see at the html variable in 
>>>>>>> extract_links 
>>>>>>> I see:
>>>>>>>
>>>>>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t
>>>>>>>
>>>>>>> I *know* that both the scrapy shell and my script are getting the 
>>>>>>> very same data from the server (checked with wireshark). So somewhere 
>>>>>>> in 
>>>>>>> between the fetching of the data and the extract_links function, the 
>>>>>>> content of the body disappears.
>>>>>>>
>>>>>>> Someone with knowledge about the source code can tell me which 
>>>>>>> function calls LxmlLinkExtractor's extract_links?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Michele C
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "scrapy-users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to