All,
Just writing a note incase this helps anyone else, I managed to get it
working with the following code, it must have been my Rules with an empty
allow() which were not working:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'linux.com'
allowed_domains = ['linux.com']
start_urls = ['http://www.linux.com']
rules = [
Rule(SgmlLinkExtractor(allow=('.+')), follow=True,
callback='parse_item', process_links='process_links'),
Rule(SgmlLinkExtractor(allow=('.+')),
callback='parse_item', process_links='process_links')
]
def process_links(self, links):
spiderList = []
for link in links:
print 'Testing link: ', link.url
#modify the link however you like here...
spiderList.append(link)
return spiderList
On Sunday, 16 March 2014 17:54:53 UTC+11, Paul P wrote:
>
> Hello All,
> I have been reading the scrapy documentation and mailing lists but
> cannot find an example which works. I don't find the documentation too
> helpful for using process_links().
>
> All I need to do is analyse each URL as it is processed and make a
> modification to it (in certain circumstances) before passing it back to
> scrapy for spidering.
>
> As a test, I would just like to print out the URL as it is being
> processed but I cannot even get that to work, example code below which I am
> calling with: "scrapy runspider test.py" or should I be calling is
> differently? my goal is to create a list of URLs which can be passed to the
> rest of my python code for analysis.
>
> from scrapy.item import Item
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from scrapy.selector import Selector
>
> class Demo(CrawlSpider):
> name = ['www.linux.com']
> allowed_domains = 'www.linux.com'
> start_urls = ['http://www.linux.com']
>
>
> rules = (
> Rule(SgmlLinkExtractor(allow=('')),
> process_links='process_links', follow=True),
> )
>
> def process_links(self,links):
> for link in links:
> print 'link: ', link #I just want to print out each URL
> as it is processed for now
> return links
>
>
> Thank you!
> Paul.
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.