Hi.
I'm trying to execute, what I believe is, a basic scrape task. However,
it's not working out. No items are loaded/saved.
My start_url is the result of a search and has the following structure:
# ------------------------------------------ start_url (page_1) structure
---------------
product_1
product_2
...
product_n
link_to_page_1
link_to_page_2
# ------------------------------------------ start_url (page_1) structure
---------------
This particular search result has one additional page (a total of 2), but
in general, the links to next pages are:
link_to_page_1
link_to_page_2
...
link_to_page_5
link_to_next_set_of_5_or_less
link_to_last_set_of_5_or_less
Each product has its url and I'm interested in the details of each product
found in those urls.
I created the Scrapy project, created the item.py file:
# --------------------------------------- Begin item.py
----------------------
from scrapy.item import Item, Field
class MedicinesSearchItem(Item):
# Primary fields
name = Field()
name_url = Field()
image_url = Field()
availability = Field()
# Calculated fields
images = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
class MedicinesItem(Item):
# Primary fields
name = Field()
image_url = Field()
stores = Field()
availabilities = Field()
update_times = Field()
update_dates = Field()
presentation_and_component = Field()
#active_component = Field()
manufacturer = Field()
# Calculated fields
images = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
# --------------------------------------- End item.py ----------------------
and the spider file:
# --------------------------------------- Begin farmaplus3.py
----------------------
from scrapy.loader import ItemLoader
from medicines.items import MedicinesItem
import scrapy
class Farmaplus3Spider(scrapy.Spider):
name = "farmaplus3"
allowed_domains = ["web"]
# Start on a medicine search page
start_urls = (
'http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go='
,
)
def parse(self, response):
for next_page in response.xpath(
'//*[@class="pageBarTableStyle"]//a/@href'):
url = response.urljoin(next_page.extract())
yield scrapy.Request(url, self.parse_medicines_results)
def parse_medicines_results(self, response):
for next_medicine in response.xpath(
'//*[@class="productNameThumbnail"]/@href'):
url = response.urljoin(next_medicine.extract())
yield scrapy.Request(url, self.parse_medicines)
def parse_medicines(self, response):
# Create the medicine item
item = MedicinesItem()
# Load fields using XPath expressions
item['name'] = response.xpath(
'//*[@class="productTitleDetail"]/text()').extract()
item['image_url'] = response.xpath(
'//*[@class="productImageDetail"]/@src').extract()
item['stores'] = response.xpath(
'//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()').extract()
yield item
# --------------------------------------- End farmaplus3.py
----------------------
However, running
$ scrapy crawl farmaplus3 -o items.json
produces the following output:
2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines)
2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE':
'medicines.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': [
'medicines.spiders'], 'BOT_NAME': 'medicines', 'ROBOTSTXT_OBEY': True,
'FEED_FORMAT': 'json'}
2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-18 21:03:11 [scrapy] INFO: Spider opened
2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:
6023
2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET
http://farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt>
from <GET http://farmaplus.com.ve/robots.txt>
2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET
http://farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt>
(referer: None)
2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http:
//farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=>
(referer: None)
2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to
'farmaplus.com.ve': <GET http:
//farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina>
2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished)
2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 793,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 21242,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 2,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)}
2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished)
My project tree is:
.
├── medicines
│ ├── __init__.py
│ ├── __init__.pyc
│ ├── items.py
│ ├── items.pyc
│ ├── pipelines.py
│ ├── settings.py
│ ├── settings.pyc
│ └── spiders
│ ├── basic2.py
│ ├── basic2.pyc
│ ├── basic.py
│ ├── basic.pyc
│ ├── farmaplus2.py
│ ├── farmaplus2.pyc
│ ├── farmaplus3.py
│ ├── farmaplus3.pyc
│ ├── farmaplus.py
│ ├── farmaplus.pyc
│ ├── __init__.py
│ └── __init__.pyc
└── scrapy.cfg
The strategy I'm trying to implement is:
i) collect the next page urls
ii) from each of the next pages, collect the product urls
iii) from each of the product urls, extract the details to fill in the item
fields
When I do this "manually" with scrapy shell it seems to work. That is, the
selectors are OK. The urls seem OK and I'm able to load item fields.
Running the file, however, doesn't give the expected results.
What am I missing here? Why do I end up without any items?
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.