Alternatively, how do I control a normal Spider from a Python script? In the doc there is only a script for controlling a CrawlSpider. If I try spider.start_requests(), nothing happens. Thanks! Michele C
Il giorno mercoledì 5 novembre 2014 10:48:31 UTC-5, Michele Coscia ha scritto: > > What if I do need to use CrawlSpider? After all I *was* using the Rule > functionality, plus several other things that are needed in my script and > that I cannot pass as simple arguments to "scrapy crawl xxx"? > Thanks! > Michele C > > Il giorno mercoledì 5 novembre 2014 10:15:31 UTC-5, Travis Leleu ha > scritto: >> >> I would recommend not using the CrawlSpider class if you're not using the >> Rule functionality. Just use a normal scrapy.Spider, and then override the >> parse() class like you said (then obviously you have to build the logic to >> identify the links to follow). >> >> When you're writing your parse function, one neat thing: you can yield >> items, and they get processed through the item pipeline, or you can yield >> requests, and they get added to the request queue. >> >> >> On Wed, Nov 5, 2014 at 7:02 AM, Michele Coscia <[email protected]> >> wrote: >> >>> Ok, but where? In the CrawlSpider? Should I basically override the >>> parse() function? Can I still use my rule in there, and if so, how? >>> Thanks! >>> Michele C >>> >>> Il giorno mercoledì 5 novembre 2014 09:55:28 UTC-5, Aru Sahni ha scritto: >>>> >>>> You can just invoke BeautifulSoup as one normally would and not use >>>> Scrapy's built-in functionality. >>>> >>>> ~A >>>> >>>> On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected]> >>>> wrote: >>>> >>>>> Bingo, that's it, you are great. >>>>> So it is what exits from the Selector(response) that is the problem, >>>>> because response contains the entire malformed html (as it should). >>>>> >>>>> I tried a little test, feeding the malformed html to Beautiful soup: >>>>> lxml parser still fails, html5lib instead parses correctly. So, the >>>>> question is: how do I use html5lib's parser instead of lxml in Scrapy? >>>>> The >>>>> documentation >>>>> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml> >>>>> >>>>> tells me that "you can easily use BeautifulSoup >>>>> <http://www.crummy.com/software/BeautifulSoup/> (or lxml >>>>> <http://lxml.de/>) instead", but it doesn't say how :-) >>>>> >>>>> Finally: I'd dare to say that this is a bug and it should be reported >>>>> as such. If any browser and html5lib can parse the page, then so should >>>>> Scrapy. Do you think I should submit it on the Github page? >>>>> >>>>> Thanks, you have been already very helpful! >>>>> Michele C >>>>> >>>>> >>>>> >>>>> >>>>> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha >>>>> scritto: >>>>>> >>>>>> Hi Michele, >>>>>> >>>>>> I've been investigating further in your problem and looks like the >>>>>> html in http://www.mass.gov/eea/agencies/dfg/der/ is malformed. You >>>>>> can see here what part of the html is really reaching extract_links: >>>>>> http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). >>>>>> This page has 4 html definitions. >>>>>> >>>>>> Hope this helps, >>>>>> Kind Regards, >>>>>> Rocio >>>>>> >>>>>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> By doing some debugging in ipdb I found out that the extract_links >>>>>>> function in the class LxmlLinkExtractor is not getting the same >>>>>>> data I see in the scrapy shell. While in the scrapy shell I see the >>>>>>> correct >>>>>>> data inside the <body> tag, when I see at the html variable in >>>>>>> extract_links >>>>>>> I see: >>>>>>> >>>>>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t >>>>>>> >>>>>>> I *know* that both the scrapy shell and my script are getting the >>>>>>> very same data from the server (checked with wireshark). So somewhere >>>>>>> in >>>>>>> between the fetching of the data and the extract_links function, the >>>>>>> content of the body disappears. >>>>>>> >>>>>>> Someone with knowledge about the source code can tell me which >>>>>>> function calls LxmlLinkExtractor's extract_links? >>>>>>> >>>>>>> Thanks! >>>>>>> Michele C >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "scrapy-users" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
