Re: Items are not saved

refp16 Fri, 19 Aug 2016 04:34:09 -0700

Hi Paul,

I feel silly. For some reason I thought 'web' was a keyword that meant "any 
domain". Not knowing how to read the output didn't help at all.


Your advice worked. The items are out.

Thanks.



On Friday, August 19, 2016 at 5:01:46 AM UTC-4, Paul Tremberth wrote:
>
> Hello,
>
> you have information in the stats at the end, and in the logs directly:
>
>  'offsite/domains': 1,
>  'offsite/filtered': 2,
>
> and
>
> DEBUG: Crawled (200) <GET http://farmaplus.com.ve/
> catalog.do?page=1&offSet=0&comboCategorySelected=1&op=
> requestSearch&searchBox=ranitidina&go=>
>  (referer: None)
> DEBUG: Filtered offsite request to 'farmaplus.com.ve': <GET http://
> farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&
> offSet=0&page=1&searchBox=ranitidina>
>
>
> Your spider is filtering requests to farmaplus.com.ve, which is the 
> domain you're interested in.
> This is because you haven't set your spider's allowed_domains attribute 
> correctly.
> It should be a list of domain names, eg.
> allowed_domains = ["farmaplus.com.ve"]
>
> This should unblock fetching pages from that website.
> I haven't checked your callbacks so you may need to fix those to get items 
> out.
>
> Hope this helps.
>
> Regards,
> Paul.
>
> On Friday, August 19, 2016 at 5:34:36 AM UTC+2, [email protected] 
> <javascript:> wrote:
>>
>> Hi.
>>
>> I'm trying to execute, what I believe is, a basic scrape task. However, 
>> it's not working out. No items are loaded/saved.
>>
>> My start_url is the result of a search and has the following structure:
>>
>> # ------------------------------------------ start_url (page_1) structure 
>> ---------------
>> product_1
>> product_2
>> ...
>> product_n
>>
>> link_to_page_1
>> link_to_page_2
>> # ------------------------------------------ start_url (page_1) structure 
>> ---------------
>>
>>
>>
>> This particular search result has one additional page (a total of 2), but 
>> in general, the links to next pages are:
>>
>> link_to_page_1
>> link_to_page_2
>> ...
>> link_to_page_5
>> link_to_next_set_of_5_or_less
>> link_to_last_set_of_5_or_less
>>
>>
>>
>> Each product has its url and I'm interested in the details of each 
>> product found in those urls.
>>
>> I created the Scrapy project, created the item.py file:
>>
>> # --------------------------------------- Begin item.py 
>> ----------------------
>> from scrapy.item import Item, Field
>>
>>
>> class MedicinesSearchItem(Item):
>>     # Primary fields
>>     name = Field()
>>     name_url = Field()
>>     image_url = Field()
>>     availability = Field()
>>
>>     # Calculated fields
>>     images = Field()
>>
>>     # Housekeeping fields
>>     url = Field()
>>     project = Field()
>>     spider = Field()
>>     server = Field()
>>     date = Field()
>>
>>
>> class MedicinesItem(Item):
>>     # Primary fields
>>     name = Field()
>>     image_url = Field()
>>     stores = Field()
>>     availabilities = Field()
>>     update_times = Field()
>>     update_dates = Field()
>>     presentation_and_component = Field()
>>     #active_component = Field()
>>     manufacturer = Field()
>>
>>     # Calculated fields
>>     images = Field()
>>
>>     # Housekeeping fields
>>     url = Field()
>>     project = Field()
>>     spider = Field()
>>     server = Field()
>>     date = Field()
>> # --------------------------------------- End item.py 
>> ----------------------
>>
>>
>>
>> and the spider file:
>>
>> # --------------------------------------- Begin farmaplus3.py 
>> ----------------------
>> from scrapy.loader import ItemLoader
>> from medicines.items import MedicinesItem
>> import scrapy
>>
>>
>> class Farmaplus3Spider(scrapy.Spider):
>>     name = "farmaplus3"
>>     allowed_domains = ["web"]
>>     # Start on a medicine search page
>>     start_urls = (
>>         '
>> http://farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=
>> ',
>>     )
>>
>>     def parse(self, response):
>>         for next_page in response.xpath(
>> '//*[@class="pageBarTableStyle"]//a/@href'):
>>             url = response.urljoin(next_page.extract())
>>             yield scrapy.Request(url, self.parse_medicines_results)
>>
>>     def parse_medicines_results(self, response):
>>         for next_medicine in response.xpath(
>> '//*[@class="productNameThumbnail"]/@href'):
>>             url = response.urljoin(next_medicine.extract())
>>             yield scrapy.Request(url, self.parse_medicines)
>>
>>     def parse_medicines(self, response):    
>>             # Create the medicine item
>>             item = MedicinesItem()
>>             # Load fields using XPath expressions
>>             item['name'] = response.xpath(
>> '//*[@class="productTitleDetail"]/text()').extract()
>>             item['image_url'] = response.xpath(
>> '//*[@class="productImageDetail"]/@src').extract()
>>             item['stores'] = response.xpath(
>> '//*[@id="branchDetailTable"]//td[@class="middleFirst"]/a/text()').
>> extract()
>>             yield item
>> # --------------------------------------- End farmaplus3.py 
>> ----------------------
>>
>>
>>
>> However, running
>>
>> $ scrapy crawl farmaplus3 -o items.json
>>
>>
>>
>> produces the following output:
>>
>> 2016-08-18 21:03:11 [scrapy] INFO: Scrapy 1.1.1 started (bot: medicines)
>> 2016-08-18 21:03:11 [scrapy] INFO: Overridden settings: {
>> 'NEWSPIDER_MODULE': 'medicines.spiders', 'FEED_URI': 'items.json', 
>> 'SPIDER_MODULES': ['medicines.spiders'], 'BOT_NAME': 'medicines', 
>> 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
>> 2016-08-18 21:03:11 [scrapy] INFO: Enabled extensions:
>> ['scrapy.extensions.feedexport.FeedExporter',
>>  'scrapy.extensions.logstats.LogStats',
>>  'scrapy.extensions.telnet.TelnetConsole',
>>  'scrapy.extensions.corestats.CoreStats']
>> 2016-08-18 21:03:11 [scrapy] INFO: Enabled downloader middlewares:
>> ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
>>  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
>>  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'
>> ,
>>  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
>>  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
>>  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
>>  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
>>  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'
>> ,
>>  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
>>  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
>>  'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
>>  'scrapy.downloadermiddlewares.stats.DownloaderStats']
>> 2016-08-18 21:03:11 [scrapy] INFO: Enabled spider middlewares:
>> ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
>>  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
>>  'scrapy.spidermiddlewares.referer.RefererMiddleware',
>>  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
>>  'scrapy.spidermiddlewares.depth.DepthMiddleware']
>> 2016-08-18 21:03:11 [scrapy] INFO: Enabled item pipelines:
>> []
>> 2016-08-18 21:03:11 [scrapy] INFO: Spider opened
>> 2016-08-18 21:03:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
>> scraped 0 items (at 0 items/min)
>> 2016-08-18 21:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1
>> :6023
>> 2016-08-18 21:03:19 [scrapy] DEBUG: Redirecting (302) to <GET http://
>> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> from 
>> <GET http://farmaplus.com.ve/robots.txt>
>> 2016-08-18 21:03:20 [scrapy] DEBUG: Crawled (404) <GET http://
>> farmaplus.com.ve/users/farmaplus.com.ve/staticResources/robots.txt> 
>> (referer: None)
>> 2016-08-18 21:03:23 [scrapy] DEBUG: Crawled (200) <GET http://
>> farmaplus.com.ve/catalog.do?page=1&offSet=0&comboCategorySelected=1&op=requestSearch&searchBox=ranitidina&go=
>> >
>>  (referer: None)
>> 2016-08-18 21:03:23 [scrapy] DEBUG: Filtered offsite request to '
>> farmaplus.com.ve': <GET http://
>> farmaplus.com.ve/catalog.do?op=requestPage&selectedPage=1&category=1&offSet=0&page=1&searchBox=ranitidina
>> >
>> 2016-08-18 21:03:23 [scrapy] INFO: Closing spider (finished)
>> 2016-08-18 21:03:23 [scrapy] INFO: Dumping Scrapy stats:
>> {'downloader/request_bytes': 793,
>>  'downloader/request_count': 3,
>>  'downloader/request_method_count/GET': 3,
>>  'downloader/response_bytes': 21242,
>>  'downloader/response_count': 3,
>>  'downloader/response_status_count/200': 1,
>>  'downloader/response_status_count/302': 1,
>>  'downloader/response_status_count/404': 1,
>>  'finish_reason': 'finished',
>>  'finish_time': datetime.datetime(2016, 8, 19, 1, 3, 23, 498723),
>>  'log_count/DEBUG': 5,
>>  'log_count/INFO': 7,
>>  'offsite/domains': 1,
>>  'offsite/filtered': 2,
>>  'request_depth_max': 1,
>>  'response_received_count': 2,
>>  'scheduler/dequeued': 1,
>>  'scheduler/dequeued/memory': 1,
>>  'scheduler/enqueued': 1,
>>  'scheduler/enqueued/memory': 1,
>>  'start_time': datetime.datetime(2016, 8, 19, 1, 3, 11, 510607)}
>> 2016-08-18 21:03:23 [scrapy] INFO: Spider closed (finished)
>>
>>
>>
>> My project tree is:
>> .
>> ├── medicines
>> │   ├── __init__.py
>> │   ├── __init__.pyc
>> │   ├── items.py
>> │   ├── items.pyc
>> │   ├── pipelines.py
>> │   ├── settings.py
>> │   ├── settings.pyc
>> │   └── spiders
>> │       ├── basic2.py
>> │       ├── basic2.pyc
>> │       ├── basic.py
>> │       ├── basic.pyc
>> │       ├── farmaplus2.py
>> │       ├── farmaplus2.pyc
>> │       ├── farmaplus3.py
>> │       ├── farmaplus3.pyc
>> │       ├── farmaplus.py
>> │       ├── farmaplus.pyc
>> │       ├── __init__.py
>> │       └── __init__.pyc
>> └── scrapy.cfg
>>
>> The strategy I'm trying to implement is: 
>>
>> i) collect the next page urls
>> ii) from each of the next pages, collect the product urls
>> iii) from each of the product urls, extract the details to fill in the 
>> item fields
>>
>> When I do this "manually" with scrapy shell it seems to work. That is, 
>> the selectors are OK. The urls seem OK and I'm able to load item fields. 
>> Running the file, however, doesn't give the expected results.
>>
>> What am I missing here? Why do I end up without any items? 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Items are not saved

Reply via email to