Re: Items are not saved

Paul Tremberth Fri, 19 Aug 2016 02:02:30 -0700

Hello,

you have information in the stats at the end, and in the logs directly:


 'offsite/domains': 1,
 'offsite/filtered': 2,

and

DEBUG: Crawled (200) <GET http://farmaplus.com.ve/
catalog.do?page=1&offSet=0&comboCategorySelected=1&op=
requestSearch&searchBox=ranitidina&go=>
 (referer: None)
DEBUG: Filtered offsite request to 'farmaplus.com.ve': <GET http://
farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&
offSet=0&page=1&searchBox=ranitidina>


Your spider is filtering requests to farmaplus.com.ve, which is the domain 
you're interested in.
This is because you haven't set your spider's allowed_domains attribute 
correctly.
It should be a list of domain names, eg.
allowed_domains = ["farmaplus.com.ve"]

This should unblock fetching pages from that website.
I haven't checked your callbacks so you may need to fix those to get items 
out.

Hope this helps.

Regards,
Paul.

On Friday, August 19, 2016 at 5:34:36 AM UTC+2, [email protected] wrote:
>
> Hi.
>
> I'm trying to execute, what I believe is, a basic scrape task. However, 
> it's not working out. No items are loaded/saved.
>
> My start_url is the result of a search and has the following structure:
>
> # ------------------------------------------ start_url (page_1) structure 
> ---------------
> product_1
> product_2
> ...
> product_n
>
> link_to_page_1
> link_to_page_2
> # ------------------------------------------ start_url (page_1) structure 
> ---------------
>
>
>
> This particular search result has one additional page (a total of 2), but 
> in general, the links to next pages are:
>
> link_to_page_1
> link_to_page_2
> ...
> link_to_page_5
> link_to_next_set_of_5_or_less
> link_to_last_set_of_5_or_less
>
>
>
> Each product has its url and I'm interested in the details of each product 
> found in those urls.
>
> I created the Scrapy project, created the item.py file:
>
> # --------------------------------------- Begin item.py 
> ----------------------
> from scrapy.item import Item, Field
>
>
> class MedicinesSearchItem(Item):
>     # Primary fields
>     name = Field()
>     name_url = Field()
>     image_url = Field()
>     availability = Field()
>
>     # Calculated fields
>     images = Field()
>
>     # Housekeeping fields
>     url = Field()
>     project = Field()
>     spider = Field()
>     server = Field()
>     date = Field()
>
>
> class MedicinesItem(Item):
>     # Primary fields
>     name = Field()
>     image_url = Field()
>     stores = Field()
>     availabilities = Field()
>     update_times = Field()
>     update_dates = Field()
>     presentation_and_component = Field()
>     #active_component = Field()
>     manufacturer = Field()
>
>     # Calculated fields
>     images = Field()
>
>     # Housekeeping fields
>     url = Field()
>     project = Field()
>     spider = Field()
>     server = Field()
>     date = Field()
> # --------------------------------------- End item.py 
> ----------------------
>
>
>
> and the spider file:
>
> # --------------------------------------- Begin farmaplus3.py 
> ----------------------
> from scrapy.loader import ItemLoader
> from medicines.items import MedicinesItem
> import scrapy
>
>
> class Farmaplus3Spider(scrapy.Spider):
>     name = "farmaplus3"
>     allowed_domains = ["web"]
>     # Start on a medicine search page
>     start_urls = (
>         '
> http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=
> ',
>     )
>
>     def parse(self, response):
>         for next_page in response.xpath(
> '//*[@class="pageBarTableStyle"]//a/@href'):
>             url = response.urljoin(next_page.extract())
>             yield scrapy.Request(url, self.parse_medicines_results)
>
>     def parse_medicines_results(self, response):
>         for next_medicine in response.xpath(
> '//*[@class="productNameThumbnail"]/@href'):
>             url = response.urljoin(next_medicine.extract())
>             yield scrapy.Request(url, self.parse_medicines)
>
>     def parse_medicines(self, response):    
>             # Create the medicine item
>             item = MedicinesItem()
>             # Load fields using XPath expressions
>             item['name'] = response.xpath(
> '//*[@class="productTitleDetail"]/text()').extract()
>             item['image_url'] = response.xpath(
> '//*[@class="productImageDetail"]/@src').extract()
>             item['stores'] = response.xpath(
> '//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()').extract
> ()
>             yield item
> # --------------------------------------- End farmaplus3.py 
> ----------------------
>
>
>
> However, running
>
> $ scrapy crawl farmaplus3 -o items.json
>
>
>
> produces the following output:
>
> 2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines)
> 2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: {
> 'NEWSPIDER_MODULE': 'medicines.spiders', 'FEED_URI': 'items.json', 
> 'SPIDER_MODULES': ['medicines.spiders'], 'BOT_NAME': 'medicines', 
> 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
> 2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions:
> ['scrapy.extensions.feedexport.FeedExporter',
>  'scrapy.extensions.logstats.LogStats',
>  'scrapy.extensions.telnet.TelnetConsole',
>  'scrapy.extensions.corestats.CoreStats']
> 2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares:
> ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
>  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
>  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
>  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
>  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
>  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
>  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
>  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
>  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
>  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
>  'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
>  'scrapy.downloadermiddlewares.stats.DownloaderStats']
> 2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares:
> ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
>  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
>  'scrapy.spidermiddlewares.referer.RefererMiddleware',
>  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
>  'scrapy.spidermiddlewares.depth.DepthMiddleware']
> 2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines:
> []
> 2016-08-18 21:03:11 [scrapy] INFO: Spider opened
> 2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
> scraped 0 items (at 0 items/min)
> 2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:
> 6023
> 2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET http://
> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> from 
> <GET http://farmaplus.com.ve/robots.txt>
> 2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET http://
> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> 
> (referer: None)
> 2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http://
> farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=
> >
>  (referer: None)
> 2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to '
> farmaplus.com.ve': <GET http://
> farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina
> >
> 2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished)
> 2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats:
> {'downloader/request_bytes': 793,
>  'downloader/request_count': 3,
>  'downloader/request_method_count/GET': 3,
>  'downloader/response_bytes': 21242,
>  'downloader/response_count': 3,
>  'downloader/response_status_count/200': 1,
>  'downloader/response_status_count/302': 1,
>  'downloader/response_status_count/404': 1,
>  'finish_reason': 'finished',
>  'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723),
>  'log_count/DEBUG': 5,
>  'log_count/INFO': 7,
>  'offsite/domains': 1,
>  'offsite/filtered': 2,
>  'request_depth_max': 1,
>  'response_received_count': 2,
>  'scheduler/dequeued': 1,
>  'scheduler/dequeued/memory': 1,
>  'scheduler/enqueued': 1,
>  'scheduler/enqueued/memory': 1,
>  'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)}
> 2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished)
>
>
>
> My project tree is:
> .
> ├── medicines
> │   ├── __init__.py
> │   ├── __init__.pyc
> │   ├── items.py
> │   ├── items.pyc
> │   ├── pipelines.py
> │   ├── settings.py
> │   ├── settings.pyc
> │   └── spiders
> │       ├── basic2.py
> │       ├── basic2.pyc
> │       ├── basic.py
> │       ├── basic.pyc
> │       ├── farmaplus2.py
> │       ├── farmaplus2.pyc
> │       ├── farmaplus3.py
> │       ├── farmaplus3.pyc
> │       ├── farmaplus.py
> │       ├── farmaplus.pyc
> │       ├── __init__.py
> │       └── __init__.pyc
> └── scrapy.cfg
>
> The strategy I'm trying to implement is: 
>
> i) collect the next page urls
> ii) from each of the next pages, collect the product urls
> iii) from each of the product urls, extract the details to fill in the 
> item fields
>
> When I do this "manually" with scrapy shell it seems to work. That is, 
> the selectors are OK. The urls seem OK and I'm able to load item fields. 
> Running the file, however, doesn't give the expected results.
>
> What am I missing here? Why do I end up without any items? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Items are not saved

Reply via email to