Re: [Tutor] Recursion depth exceeded in python web crawler

2018-06-14 Thread Mark Lawrence

On 14/06/18 19:32, Daniel Bosah wrote:

I am trying to modify code from a web crawler to scrape for keywords from
certain websites. However, Im trying to run the web crawler before  I
modify it, and I'm running into issues.

When I ran this code -




*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*


*PROJECT_NAME = "SPIDER"*
*HOME_PAGE = "https://www.cracked.com/ "*
*DOMAIN_NAME = get_domain_name(HOME_PAGE)*
*QUEUE_FILE = '/home/me/research/queue.txt'*
*CRAWLED_FILE = '/home/me/research/crawled.txt'*
*NUMBER_OF_THREADS = 1*
*#Captialize variables and make them class variables to make them const
variables*

*threadqueue = Queue()*

*Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*

*def crawl():*
*change = file_to_set(QUEUE_FILE)*
*if len(change) > 0:*
*print str(len(change)) + 'links in the queue'*
*create_jobs()*

*def create_jobs():*
*for link in file_to_set(QUEUE_FILE):*
*threadqueue.put(link) #.put = put item into the queue*
*threadqueue.join()*
*crawl()*
*def create_spiders():*
*for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
act on the iterable*
*vari = threading.Thread(target = work)*
*vari.daemon = True #makes sure that it dies when main exits*
*vari.start()*

*#def regex():*
*#for i in files_to_set(CRAWLED_FILE):*
*  #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
list or set of keywords*
*def work():*
*while True:*
*url = threadqueue.get()# pops item off queue*
*Spider.crawl_pages(threading.current_thread().name,url)*
*threadqueue.task_done()*

*create_spiders()*

*crawl()*


That used this class:

*from HTMLParser import HTMLParser*
*from urlparse import urlparse*

*class LinkFinder(HTMLParser):*
*def _init_(self, base_url,page_url):*
*super()._init_()*
*self.base_url= base_url*
*self.page_url = page_url*
*self.links = set() #stores the links*
*def error(self,message):*
*pass*
*def handle_starttag(self,tag,attrs):*
*if tag == 'a': # means a link*
*for (attribute,value) in attrs:*
*if attribute  == 'href':  #href relative url i.e not
having www*
*url = urlparse.urljoin(self.base_url,value)*
*self.links.add(url)*
*def return_links(self):*
*return self.links()*


It's very unpythonic to define getters like return_links, just access 
self.links directly.





And this spider class:



*from urllib import urlopen #connects to webpages from python*
*from link_finder import LinkFinder*
*from general import directory, text_maker, file_to_set, conversion_to_set*


*class Spider():*
* project_name = 'Reader'*
* base_url = ''*
* Queue_file = ''*
* crawled_file = ''*
* queue = set()*
* crawled = set()*


* def __init__(self,project_name, base_url,domain_name):*
* Spider.project_name = project_name*
* Spider.base_url = base_url*
* Spider.domain_name = domain_name*
* Spider.Queue_file =  '/home/me/research/queue.txt'*
* Spider.crawled_file =  '/home/me/research/crawled.txt'*
* self.boot()*
* self.crawl_pages('Spider 1 ', base_url)*


It strikes me as completely pointless to define this class when every 
variable is at the class level and every method is defined as a static 
method.  Python isn't Java :)


[code snipped]



and these functions:



*from urlparse import urlparse*

*#get subdomain name (name.example.com )*

*def subdomain_name(url):*
*try:*
*return urlparse(url).netloc*
*except:*
*return ''*


It's very bad practice to use a bare except like this as it hides any 
errors and prevents you from using CTRL-C to break out of your code.




*def get_domain_name(url):*
*try:*
*variable = subdomain_name.split(',')*
*return variable[-2] + ',' + variable[-1] #returns 2nd to last and
last instances of variable*
*except:*
*return '''*


The above line is a syntax error.




(there are more functions, but those are housekeeping functions)


The interpreter returned this error:

*RuntimeError: maximum recursion depth exceeded while calling a Python
object*


After calling crawl() and create_jobs() a bunch of times?

How can I resolve this?

Thanks


Just a quick glance but crawl calls create_jobs which calls crawl...

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Recursion depth exceeded in python web crawler

2018-06-14 Thread Steven D'Aprano
On Thu, Jun 14, 2018 at 02:32:46PM -0400, Daniel Bosah wrote:

> I am trying to modify code from a web crawler to scrape for keywords from
> certain websites. However, Im trying to run the web crawler before  I
> modify it, and I'm running into issues.
> 
> When I ran this code -

[snip enormous code-dump]

> The interpreter returned this error:
> 
> *RuntimeError: maximum recursion depth exceeded while calling a Python
> object*

Since this is not your code, you should report it as a bug to the 
maintainers of the web crawler software. They wrote it, and it sounds 
like it is buggy.

Quoting the final error message on its own is typically useless, because 
we have no context as to where it came from. We don't know and cannot 
guess what object was called. Without that information, we're blind and 
cannot do more than guess or offer the screamingly obvious advice "find 
and fix the recursion error".

When an error does occur, Python provides you with a lot of useful 
information about the context of the error: the traceback. As a general 
rule, you should ALWAYS quote the entire traceback, starting from the 
line beginning "Traceback: ..." not just the final error message.

Unfortunately, in the case of RecursionError, that information can be a 
firehose of hundreds of identical lines, which is less useful than it 
sounds. The most recent versions of Python redacts that and shows 
something similar to this:

Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in f
  [ previous line repeats 998 times ]
RecursionError: maximum recursion depth exceeded

but in older versions you should manually cut out the enormous flood of 
lines (sorry). If the lines are NOT identical, then don't delete them!

The bottom line is, without some context, it is difficult for us to tell 
where the bug is.

Another point: whatever you are using to post your messages (Gmail?) is 
annoyingly adding asterisks to the start and end of each line. I see 
your quoted code like this:

[direct quote]
*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*

Notice the * at the start and end of each line? That makes the code 
invalid Python. You should check how you are posting to the list, and if 
you have "Rich Text" or some other formatting turned on, turn it off.

(My guess is that you posted the code in BOLD or perhaps some colour 
other than black, and your email program "helpfully" added asterisks to 
it to make it stand out.)

Unfortunately modern email programs, especially web-based ones like 
Gmail and Outlook.com, make it *really difficult* for technical forums 
like this. They are so intent on making email "pretty" (generally pretty 
ugly) for regular users, they punish technically minded users who need
to focus on the text not the presentation.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] Recursion depth exceeded in python web crawler

2018-06-14 Thread Daniel Bosah
I am trying to modify code from a web crawler to scrape for keywords from
certain websites. However, Im trying to run the web crawler before  I
modify it, and I'm running into issues.

When I ran this code -




*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*


*PROJECT_NAME = "SPIDER"*
*HOME_PAGE = "https://www.cracked.com/ "*
*DOMAIN_NAME = get_domain_name(HOME_PAGE)*
*QUEUE_FILE = '/home/me/research/queue.txt'*
*CRAWLED_FILE = '/home/me/research/crawled.txt'*
*NUMBER_OF_THREADS = 1*
*#Captialize variables and make them class variables to make them const
variables*

*threadqueue = Queue()*

*Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*

*def crawl():*
*change = file_to_set(QUEUE_FILE)*
*if len(change) > 0:*
*print str(len(change)) + 'links in the queue'*
*create_jobs()*

*def create_jobs():*
*for link in file_to_set(QUEUE_FILE):*
*threadqueue.put(link) #.put = put item into the queue*
*threadqueue.join()*
*crawl()*
*def create_spiders():*
*for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
act on the iterable*
*vari = threading.Thread(target = work)*
*vari.daemon = True #makes sure that it dies when main exits*
*vari.start()*

*#def regex():*
*#for i in files_to_set(CRAWLED_FILE):*
*  #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
list or set of keywords*
*def work():*
*while True:*
*url = threadqueue.get()# pops item off queue*
*Spider.crawl_pages(threading.current_thread().name,url)*
*threadqueue.task_done()*

*create_spiders()*

*crawl()*


That used this class:

*from HTMLParser import HTMLParser*
*from urlparse import urlparse*

*class LinkFinder(HTMLParser):*
*def _init_(self, base_url,page_url):*
*super()._init_()*
*self.base_url= base_url*
*self.page_url = page_url*
*self.links = set() #stores the links*
*def error(self,message):*
*pass*
*def handle_starttag(self,tag,attrs):*
*if tag == 'a': # means a link*
*for (attribute,value) in attrs:*
*if attribute  == 'href':  #href relative url i.e not
having www*
*url = urlparse.urljoin(self.base_url,value)*
*self.links.add(url)*
*def return_links(self):*
*return self.links()*


And this spider class:



*from urllib import urlopen #connects to webpages from python*
*from link_finder import LinkFinder*
*from general import directory, text_maker, file_to_set, conversion_to_set*


*class Spider():*
* project_name = 'Reader'*
* base_url = ''*
* Queue_file = ''*
* crawled_file = ''*
* queue = set()*
* crawled = set()*


* def __init__(self,project_name, base_url,domain_name):*
* Spider.project_name = project_name*
* Spider.base_url = base_url*
* Spider.domain_name = domain_name*
* Spider.Queue_file =  '/home/me/research/queue.txt'*
* Spider.crawled_file =  '/home/me/research/crawled.txt'*
* self.boot()*
* self.crawl_pages('Spider 1 ', base_url)*

* @staticmethod  *
* def boot():*
*  directory(Spider.project_name)*
*  text_maker(Spider.project_name,Spider.base_url)*
*  Spider.queue = file_to_set(Spider.Queue_file)*
*  Spider.crawled = file_to_set(Spider.crawled_file)*
* @staticmethod*
* def crawl_pages(thread_name, page_url):*
*  if page_url not in Spider.crawled:*
*  print thread_name + 'crawling ' + page_url*
*  print 'queue' + str(len(Spider.queue)) + '|crawled' +
str(len(Spider.crawled))*
*  Spider.add_links_to_queue(Spider.gather_links(page_url))*
*  Spider.crawled.add(page_url)*
*  Spider.update_files()*
* @staticmethod*
* def gather_links(page_url):*
*  html_string = ''*
*  try:*
*  response = urlopen(page_url)*
*  if 'text/html' in response.getheader('Content Type'):*
*  read = response.read()*
*  html_string = read.decode('utf-8')*
*  finder = LinkFinder(Spider.base_url,page_url)*
*  finder.feed(html_string)*
*  except:*
*   print 'Error: cannot crawl page'*
*   return set()*
*  return finder.return_links()*

* @staticmethod*
* def add_links_to_queue(links):*
*for i in links:*
*if i in Spider.queue:*
*continue*
*if i in Spider.crawled:*
*continue*
*   # if Spider.domain_name != get_domain_name(url):*
*#continue*
*Spider.queue.add()*
* @staticmethod*
* def update_files():*
*conversion_to_set(Spider.queue,Spider.Queue_file)*
*conversion_t