there is a subtle and mysterious issue with this scraper

Moyi Dang Tue, 04 Oct 2016 02:27:39 -0700

I'm trying to page through kickstarter and collect every link on this page. 
Here is the crawler code:



from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from kickstarter_test.items import KickstarterSampleItem




class MySpider(BaseSpider):
    name = "kicks2"
    allowed_domains = ["www.kickstarter.com"]
    start_urls = [
"https://www.kickstarter.com/discover/advanced?category_id=1&woe_id=2347563&sort=most_funded&page=";
 
+ str(page) for page in range(156)]
    custom_settings = {
    "DOWNLOAD_DELAY": 0.1
    }
    
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        
        titles = hxs.xpath('//div[@class="project-profile-title 
text-truncate-xs"][ancestor::ul[@id="projects_list"]]')
        items = []
        for titles in titles:
            item = KickstarterSampleItem()
            item["title"] = titles.xpath("a/text()").extract()
            item["link"] = titles.xpath("a/@href").extract()
            item["source_url"] = response.url
            items.append(item)
        
        titles2 = hxs.xpath(
'//h6[@class="project-title"][ancestor::ul[@id="projects_list"]]')
        for titles in titles2:
            item = KickstarterSampleItem()
            item["title"] = titles.xpath("a/text()").extract()
            item["link"] = titles.xpath("a/@href").extract()
            item["source_url"] = response.url
            items.append(item)
        
        return(items)

I just ran this scraper, and it returned 3120 items. However, after I 
remove duplicated links, I only get 3007. Actually, if I run the same code 
multiple times, I would always get some duplicates, but not actually the 
same duplicates. In this way, I'm not able to get all links because every 
scrape is missing a few while having duplicates. 

I'm not completely sure why this is happening. My theory is that, the page 
is ranking the projects by most funded, and perhaps as the scrape happens, 
some of the order of the project changed?

Also, to remove for duplicates, I used this code:

import pandas as pd

dt = pd.read_csv("links.csv", dtype = str)
print(len(dt))
dt_deduped =dt.drop_duplicates(subset = ['link'])
print(len(dt_deduped))

Any suggestions are welcomed! Thank you!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

there is a subtle and mysterious issue with this scraper

Reply via email to