Re: [Tutor] Recursion depth exceeded in python web crawler

2018-06-14 Thread Mark Lawrence

On 14/06/18 19:32, Daniel Bosah wrote:

I am trying to modify code from a web crawler to scrape for keywords from
certain websites. However, Im trying to run the web crawler before  I
modify it, and I'm running into issues.

When I ran this code -




*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*


*PROJECT_NAME = "SPIDER"*
*HOME_PAGE = "https://www.cracked.com/ "*
*DOMAIN_NAME = get_domain_name(HOME_PAGE)*
*QUEUE_FILE = '/home/me/research/queue.txt'*
*CRAWLED_FILE = '/home/me/research/crawled.txt'*
*NUMBER_OF_THREADS = 1*
*#Captialize variables and make them class variables to make them const
variables*

*threadqueue = Queue()*

*Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*

*def crawl():*
*change = file_to_set(QUEUE_FILE)*
*if len(change) > 0:*
*print str(len(change)) + 'links in the queue'*
*create_jobs()*

*def create_jobs():*
*for link in file_to_set(QUEUE_FILE):*
*threadqueue.put(link) #.put = put item into the queue*
*threadqueue.join()*
*crawl()*
*def create_spiders():*
*for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
act on the iterable*
*vari = threading.Thread(target = work)*
*vari.daemon = True #makes sure that it dies when main exits*
*vari.start()*

*#def regex():*
*#for i in files_to_set(CRAWLED_FILE):*
*  #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
list or set of keywords*
*def work():*
*while True:*
*url = threadqueue.get()# pops item off queue*
*Spider.crawl_pages(threading.current_thread().name,url)*
*threadqueue.task_done()*

*create_spiders()*

*crawl()*


That used this class:

*from HTMLParser import HTMLParser*
*from urlparse import urlparse*

*class LinkFinder(HTMLParser):*
*def _init_(self, base_url,page_url):*
*super()._init_()*
*self.base_url= base_url*
*self.page_url = page_url*
*self.links = set() #stores the links*
*def error(self,message):*
*pass*
*def handle_starttag(self,tag,attrs):*
*if tag == 'a': # means a link*
*for (attribute,value) in attrs:*
*if attribute  == 'href':  #href relative url i.e not
having www*
*url = urlparse.urljoin(self.base_url,value)*
*self.links.add(url)*
*def return_links(self):*
*return self.links()*


It's very unpythonic to define getters like return_links, just access 
self.links directly.





And this spider class:



*from urllib import urlopen #connects to webpages from python*
*from link_finder import LinkFinder*
*from general import directory, text_maker, file_to_set, conversion_to_set*


*class Spider():*
* project_name = 'Reader'*
* base_url = ''*
* Queue_file = ''*
* crawled_file = ''*
* queue = set()*
* crawled = set()*


* def __init__(self,project_name, base_url,domain_name):*
* Spider.project_name = project_name*
* Spider.base_url = base_url*
* Spider.domain_name = domain_name*
* Spider.Queue_file =  '/home/me/research/queue.txt'*
* Spider.crawled_file =  '/home/me/research/crawled.txt'*
* self.boot()*
* self.crawl_pages('Spider 1 ', base_url)*


It strikes me as completely pointless to define this class when every 
variable is at the class level and every method is defined as a static 
method.  Python isn't Java :)


[code snipped]



and these functions:



*from urlparse import urlparse*

*#get subdomain name (name.example.com )*

*def subdomain_name(url):*
*try:*
*return urlparse(url).netloc*
*except:*
*return ''*


It's very bad practice to use a bare except like this as it hides any 
errors and prevents you from using CTRL-C to break out of your code.




*def get_domain_name(url):*
*try:*
*variable = subdomain_name.split(',')*
*return variable[-2] + ',' + variable[-1] #returns 2nd to last and
last instances of variable*
*except:*
*return '''*


The above line is a syntax error.




(there are more functions, but those are housekeeping functions)


The interpreter returned this error:

*RuntimeError: maximum recursion depth exceeded while calling a Python
object*


After calling crawl() and create_jobs() a bunch of times?

How can I resolve this?

Thanks


Just a quick glance but crawl calls create_jobs which calls crawl...

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Recursion depth exceeded in python web crawler

2018-06-14 Thread Steven D'Aprano
On Thu, Jun 14, 2018 at 02:32:46PM -0400, Daniel Bosah wrote:

> I am trying to modify code from a web crawler to scrape for keywords from
> certain websites. However, Im trying to run the web crawler before  I
> modify it, and I'm running into issues.
> 
> When I ran this code -

[snip enormous code-dump]

> The interpreter returned this error:
> 
> *RuntimeError: maximum recursion depth exceeded while calling a Python
> object*

Since this is not your code, you should report it as a bug to the 
maintainers of the web crawler software. They wrote it, and it sounds 
like it is buggy.

Quoting the final error message on its own is typically useless, because 
we have no context as to where it came from. We don't know and cannot 
guess what object was called. Without that information, we're blind and 
cannot do more than guess or offer the screamingly obvious advice "find 
and fix the recursion error".

When an error does occur, Python provides you with a lot of useful 
information about the context of the error: the traceback. As a general 
rule, you should ALWAYS quote the entire traceback, starting from the 
line beginning "Traceback: ..." not just the final error message.

Unfortunately, in the case of RecursionError, that information can be a 
firehose of hundreds of identical lines, which is less useful than it 
sounds. The most recent versions of Python redacts that and shows 
something similar to this:

Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in f
  [ previous line repeats 998 times ]
RecursionError: maximum recursion depth exceeded

but in older versions you should manually cut out the enormous flood of 
lines (sorry). If the lines are NOT identical, then don't delete them!

The bottom line is, without some context, it is difficult for us to tell 
where the bug is.

Another point: whatever you are using to post your messages (Gmail?) is 
annoyingly adding asterisks to the start and end of each line. I see 
your quoted code like this:

[direct quote]
*import threading*
*from Queue import Queue*
*from spider import Spider*
*from domain import get_domain_name*
*from general import file_to_set*

Notice the * at the start and end of each line? That makes the code 
invalid Python. You should check how you are posting to the list, and if 
you have "Rich Text" or some other formatting turned on, turn it off.

(My guess is that you posted the code in BOLD or perhaps some colour 
other than black, and your email program "helpfully" added asterisks to 
it to make it stand out.)

Unfortunately modern email programs, especially web-based ones like 
Gmail and Outlook.com, make it *really difficult* for technical forums 
like this. They are so intent on making email "pretty" (generally pretty 
ugly) for regular users, they punish technically minded users who need
to focus on the text not the presentation.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-09 Thread Steven D'Aprano
On Wed, Jan 08, 2014 at 06:16:03PM -0500, Dave Angel wrote:
 On Wed, 8 Jan 2014 16:23:06 -0500, eryksun eryk...@gmail.com wrote:
 On Wed, Jan 8, 2014 at 3:25 PM, Keith Winston keithw...@gmail.com 
 wrote:
  I've been playing with recursion, it's very satisfying.
 
  However, it appears that even if I sys.setrecursionlimit(10), 
 it blows
  up at about 24,000 (appears to reset IDLE). I guess there must be 
 a lot of
  overhead with recursion, if only 24k times are killing my memory?
 
 I can't see the bodies of any of your messages (are you perchance 
 posting in html? ), 

I presume that your question is aimed at Keith.

Yes, Keith's emails have a HTML part and a text part. A half-decent mail 
client should be able to read the text part even if the HTML part 
exists. But I believe you're reading this from gmane's Usenet mirror, is 
that correct? Perhaps there's a problem with gmane, or your news client, 
or both.

Since this is officially a mailing list, HTML mail is discouraged but 
not strongly discouraged (that ship has sailed a long time ago, more's 
the pity...) so long as the sender includes a plain text part too. Which 
Keith does.

Keith, if you are able, and would be so kind, you'll help solve this 
issue for Dave if you configure your mail client to turn so-called rich 
text or formatted text off, at least for this mailing list.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-09 Thread Dave Angel
On Thu, 9 Jan 2014 21:41:41 +1100, Steven D'Aprano 
st...@pearwood.info wrote:

I presume that your question is aimed at Keith.



Yes, Keith's emails have a HTML part and a text part. A half-decent 
mail 
client should be able to read the text part even if the HTML part 
exists. But I believe you're reading this from gmane's Usenet 
mirror, is 
that correct? Perhaps there's a problem with gmane, or your news 
client, 

or both.



Since this is officially a mailing list, HTML mail is discouraged 
but 
not strongly discouraged (that ship has sailed a long time ago, 
more's 
the pity...) so long as the sender includes a plain text part too. 
Which 

Keith does.


Yes I'm pretty sure it's Groundhog's fault. In tutor list, all I see 
of Keith ' messages is the 3-line footer. And in python.general I see 
nothing for such messages. 

I've used outlook express and Thunderbird and xpn for many years 
here. But a couple of months ago I switched to an android tablet,  
and Groundhog newsreader and Android Usenet have this problem 
with html here. I am using gmane,  but the other gmane sites don't 
have this problem.  Instead they show uninterpreted html on 
groundhog.  Those sites all happen to be googlegroups,  so that's 
another variable. 


Anybody know of an android solution?

--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-09 Thread Keith Winston
On Thu, Jan 9, 2014 at 5:41 AM, Steven D'Aprano st...@pearwood.info wrote:

 Keith, if you are able, and would be so kind, you'll help solve this
 issue for Dave if you configure your mail client to turn so-called rich
 text or formatted text off, at least for this mailing list.

Well, hopefully this is plain text. It all looks the same to me, so if
gmail switches back, it might go unnoticed for a while. Sorry for the
incessant hassle.


-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-09 Thread Dave Angel
On Thu, 9 Jan 2014 13:02:30 -0500, Keith Winston 
keithw...@gmail.com wrote:
Well, hopefully this is plain text. It all looks the same to me, so 

if
gmail switches back, it might go unnoticed for a while. Sorry for 

the

incessant hassle.


That looks great,  thanks.

--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Emile van Sebille

On 1/8/2014 12:25 PM, Keith Winston wrote:

I've been playing with recursion, it's very satisfying.

However, it appears that even if I sys.setrecursionlimit(10), it
blows up at about 24,000 (appears to reset IDLE). I guess there must be
a lot of overhead with recursion, if only 24k times are killing my memory?


Yes -- the docs warn specifically about that:

sys.setrecursionlimit(limit)ΒΆ
Set the maximum depth of the Python interpreter stack to limit. This 
limit prevents infinite recursion from causing an overflow of the C 
stack and crashing Python.


The highest possible limit is platform-dependent. A user may need to set 
the limit higher when she has a program that requires deep recursion and 
a platform that supports a higher limit. This should be done with care, 
because a too-high limit can lead to a crash.




I'm playing with a challenge a friend gave me: add each number, up to
1000, with it's reverse, continuing the process until you've got a
palindrome number. Then report back the number of iterations it takes.
There's one number, 196, that never returns, so I skip it. It's a
perfect place to practice recursion (I do it in the adding part, and the
palindrome checking part), but apparently I can't help but blow up my
machine...


Without seeing your code it's hard to be specific, but it's obvious 
you'll need to rethink your approach.  :)


Emile



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Keith Winston
On Wed, Jan 8, 2014 at 3:42 PM, Emile van Sebille em...@fenx.com wrote:


 Without seeing your code it's hard to be specific, but it's obvious you'll
 need to rethink your approach.  :)



Yes, it's clear I need to do the bulk of it without recusion, I haven't
really thought about how to do that. I may or may not ever get around to
doing it, since this was primarily an exercise in recursion, for me...
Thanks for your thoughts.

-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread eryksun
On Wed, Jan 8, 2014 at 3:25 PM, Keith Winston keithw...@gmail.com wrote:
 I've been playing with recursion, it's very satisfying.

 However, it appears that even if I sys.setrecursionlimit(10), it blows
 up at about 24,000 (appears to reset IDLE). I guess there must be a lot of
 overhead with recursion, if only 24k times are killing my memory?

CPython recursion is limited by the thread's stack size, since
evaluating a Python frame requires calling PyEval_EvalFrameEx. The
default stack size on Windows is 1 MiB, and on Linux RLIMIT_STACK is
typically set at 8 MiB (inspect this w/ the stdlib's resource module).

You can create a worker thread with a larger stack using the threading
module. On Windows the upper limit is 256 MiB, so give this a try:

import sys
import threading

MiB = 2 ** 20
threading.stack_size(256 * MiB)

sys.setrecursionlimit(10)

t = threading.Thread(target=your_function)
t.start()

I'm not saying this is a good solution in general. It's just something
to play around with, and may help in a pinch.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread spir

On 01/08/2014 10:11 PM, Keith Winston wrote:

On Wed, Jan 8, 2014 at 3:42 PM, Emile van Sebille em...@fenx.com wrote:



Without seeing your code it's hard to be specific, but it's obvious you'll
need to rethink your approach.  :)




Yes, it's clear I need to do the bulk of it without recusion, I haven't
really thought about how to do that. I may or may not ever get around to
doing it, since this was primarily an exercise in recursion, for me...
Thanks for your thoughts.


Funny and useful exercise in recursion: write a func that builds str and repr 
expressions of any object, whatever its attributes, inductively. Eg with


obj.__repr__() = Type(attr1, attr2...)  # as in code
obj.__str__()  = {id1:attr1 id2:attr2...}   # nicer

Denis

PS: Don't knwo why it's not builtin, would be very useful for debugging, 
testing, any kind of programmer feedback. Guess it has to do with cycles, but 
there are ways to do that; and python manages cycles in list expressions:


spir@ospir:~$ python3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type help, copyright, credits or license for more information.

l1 = [1]
l2 = [1, l1]
l1.extend([l2,l1])
l1

[1, [1, [...]], [...]]

l2

[1, [1, [...], [...]]]
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Keith Winston
On Wed, Jan 8, 2014 at 4:23 PM, eryksun eryk...@gmail.com wrote:

 You can create a worker thread with a larger stack using the threading
 module. On Windows the upper limit is 256 MiB, so give this a try:



quite excellent, mwahaha... another shovel to help me excavate out the
bottom of my hole... I'll play with  this someday, but maybe not today. I
seem to be pushing some dangerous limits. Which does happen to be a hobby
of mine.


-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Keith Winston
On Wed, Jan 8, 2014 at 5:15 PM, spir denis.s...@gmail.com wrote:

 Funny and useful exercise in recursion: write a func that builds str and
 repr expressions of any object, whatever its attributes, inductively. Eg
 with


Hmm, can't say I get the joke. I haven't really played with repr, though I
think I understand it's use. Could you give an example, I'm not sure I
understand the goal?


-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Dave Angel

On Wed, 8 Jan 2014 16:23:06 -0500, eryksun eryk...@gmail.com wrote:
On Wed, Jan 8, 2014 at 3:25 PM, Keith Winston keithw...@gmail.com 

wrote:

 I've been playing with recursion, it's very satisfying.

 However, it appears that even if I sys.setrecursionlimit(10), 

it blows
 up at about 24,000 (appears to reset IDLE). I guess there must be 

a lot of

 overhead with recursion, if only 24k times are killing my memory?


I can't see the bodies of any of your messages (are you perchance 
posting in html? ),  but I think there's a good chance you're abusing 
recursion and therefore hitting the limit much sooner than necessary. 
I've seen some code samples here using recursion to fake a goto,  
for example.  One question to ask is whether each time you recurse, 
are you now solving a simpler problem. 

For example,  when iterating over a tree you should only recurse when 
processing a SHALLOWER subtree.


--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] recursion depth

2014-01-08 Thread Keith Winston
On Wed, Jan 8, 2014 at 6:16 PM, Dave Angel da...@davea.name wrote:

 I can't see the bodies of any of your messages (are you perchance posting
 in html? ),  but I think there's a good chance you're abusing recursion and
 therefore hitting the limit much sooner than necessary. I've seen some code
 samples here using recursion to fake a goto,  for example.  One question to
 ask is whether each time you recurse, are you now solving a simpler
 problem.
 For example,  when iterating over a tree you should only recurse when
 processing a SHALLOWER subtree.


Hi Dave: I've been taken to task so often here about having unnecessary
chaff in my email replies, that I started getting in the habit of deleting
everything (since gmail by (unadjustable) default quotes the entire message
string unless you highlight/reply). Since I look at messages in a threaded
manner, I wasn't really realizing how much of a pain that was for others.
I'm trying to  re-establish a highlight/reply habit, like this.

I don't THINK I'm misusing recursion, I think I'm just recursing ridiculous
things. The problem came in creating palindrome numbers. Apparently, if you
add a number to it's reversal (532 + 235), it will be a palindrome, or do
it again (with the first result)... with the only mysterious exception of
196, as I understand it. Interestingly, most numbers reach this palindrome
state rather quickly: in the first 1000 numbers, here are the number of
iterations it takes (numbers don't get credit for being palindromes before
you start):

{0: 13, 1: 291, 2: 339, 3: 158, 4: 84, 5: 33, 6: 15, 7: 18, 8: 10, 10: 2,
11: 8, 14: 2, 15: 8, 16: 1, 17: 5, 19: 1, 22: 2, 23: 8, 24: 2}

Zero stands for where I ran out of recursion depth, set at the time at
9900. Except for the first zero, which is set at 196. It's sort of
fascinating: those two 24's both occur in the first 100.

So it hardly ever takes any iterations to palindromize a number, except
when it takes massive numbers. Except in the single case of 196, where it
never, ever happens apparently (though I understand this to not be proven,
merely tested out to a few million places).

-- 
Keith
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor