Items are not saved

refp16 Thu, 18 Aug 2016 20:34:55 -0700

Hi.

I'm trying to execute, what I believe is, a basic scrape task. However, 
it's not working out. No items are loaded/saved.


My start_url is the result of a search and has the following structure:

# ------------------------------------------ start_url (page_1) structure 
---------------
product_1
product_2
...
product_n

link_to_page_1
link_to_page_2
# ------------------------------------------ start_url (page_1) structure 
---------------



This particular search result has one additional page (a total of 2), but 
in general, the links to next pages are:

link_to_page_1
link_to_page_2
...
link_to_page_5
link_to_next_set_of_5_or_less
link_to_last_set_of_5_or_less



Each product has its url and I'm interested in the details of each product 
found in those urls.

I created the Scrapy project, created the item.py file:

# --------------------------------------- Begin item.py 
----------------------
from scrapy.item import Item, Field


class MedicinesSearchItem(Item):
    # Primary fields
    name = Field()
    name_url = Field()
    image_url = Field()
    availability = Field()

    # Calculated fields
    images = Field()

    # Housekeeping fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()


class MedicinesItem(Item):
    # Primary fields
    name = Field()
    image_url = Field()
    stores = Field()
    availabilities = Field()
    update_times = Field()
    update_dates = Field()
    presentation_and_component = Field()
    #active_component = Field()
    manufacturer = Field()

    # Calculated fields
    images = Field()

    # Housekeeping fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()
# --------------------------------------- End item.py ----------------------



and the spider file:

# --------------------------------------- Begin farmaplus3.py 
----------------------
from scrapy.loader import ItemLoader
from medicines.items import MedicinesItem
import scrapy


class Farmaplus3Spider(scrapy.Spider):
    name = "farmaplus3"
    allowed_domains = ["web"]
    # Start on a medicine search page
    start_urls = (
        
'http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go='
,
    )

    def parse(self, response):
        for next_page in response.xpath(
'//*[@class="pageBarTableStyle"]//a/@href'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse_medicines_results)

    def parse_medicines_results(self, response):
        for next_medicine in response.xpath(
'//*[@class="productNameThumbnail"]/@href'):
            url = response.urljoin(next_medicine.extract())
            yield scrapy.Request(url, self.parse_medicines)

    def parse_medicines(self, response):    
            # Create the medicine item
            item = MedicinesItem()
            # Load fields using XPath expressions
            item['name'] = response.xpath(
'//*[@class="productTitleDetail"]/text()').extract()
            item['image_url'] = response.xpath(
'//*[@class="productImageDetail"]/@src').extract()
            item['stores'] = response.xpath(
'//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()').extract()
            yield item
# --------------------------------------- End farmaplus3.py 
----------------------



However, running

$ scrapy crawl farmaplus3 -o items.json



produces the following output:

2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines)
2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 
'medicines.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': [
'medicines.spiders'], 'BOT_NAME': 'medicines', 'ROBOTSTXT_OBEY': True, 
'FEED_FORMAT': 'json'}
2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-18 21:03:11 [scrapy] INFO: Spider opened
2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:
6023
2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET 
http://farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> 
from <GET http://farmaplus.com.ve/robots.txt>
2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET 
http://farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> 
(referer: None)
2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http:
//farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=>
 (referer: None)
2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to 
'farmaplus.com.ve': <GET http:
//farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina>
2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished)
2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 793,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 21242,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 2,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)}
2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished)



My project tree is:
.
├── medicines
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── pipelines.py
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── basic2.py
│       ├── basic2.pyc
│       ├── basic.py
│       ├── basic.pyc
│       ├── farmaplus2.py
│       ├── farmaplus2.pyc
│       ├── farmaplus3.py
│       ├── farmaplus3.pyc
│       ├── farmaplus.py
│       ├── farmaplus.pyc
│       ├── __init__.py
│       └── __init__.pyc
└── scrapy.cfg

The strategy I'm trying to implement is: 

i) collect the next page urls
ii) from each of the next pages, collect the product urls
iii) from each of the product urls, extract the details to fill in the item 
fields

When I do this "manually" with scrapy shell it seems to work. That is, the 
selectors are OK. The urls seem OK and I'm able to load item fields. 
Running the file, however, doesn't give the expected results.

What am I missing here? Why do I end up without any items? 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Items are not saved

Reply via email to