On Sun, Jun 27, 2010 at 12:15 PM, Khawla Al-Wehaibi <kweha...@yahoo.com>wrote:
> Hi, > > I’m new to programming. I’m currently learning python to write a web > crawler to extract all text from a web page, in addition to, crawling to > further URLs and collecting the text there. The idea is to place all the > extracted text in a .txt file with each word in a single line. So the text > has to be tokenized. All punctuation marks, duplicate words and non-stop > words have to be removed. > Welcome to Python! What you are doing is best done in a multi step process so that you can understand everything that you are doing. To really leverage Python, there are a couple of things that you need to read right off the bat. http://docs.python.org/library/stdtypes.html (Stuff about strings). In Python, everything is an object so everything will have methods or functions related to it. For instance, the String object has a find method that will return position of the string. Pretty handy if you ask me. Also, I would read up on sets for python. That will reduce the size of your code significantly. > > The program should crawl the web to a certain depth and collect the URLs > and text from each depth (level). I decided to choose a depth of 3. I > divided the code to two parts. Part one to collect the URLs and part two to > extract the text. Here is my problem: > > 1. The program is extremely slow. > The best way to go about this is to use a profiler: http://docs.python.org/library/profile.html 2. I'm not sure if it functions properly. > To debug your code, you may want to read up on the python debugger. http://docs.python.org/library/pdb.html 3. Is there a better way to extract text? > See the strings and the lists. I think that you will be pleasantly surprised > 4. Are there any available modules to help clean the text i.e. removing > duplicates, non-stop words ... > Read up on sets and the string functions/method. They are your friend > 5. Any suggestions or feedback is appreciated. > > -Tino PS: Please don't send html ladden emails, it makes it harder to work with. Thanks
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor