Re: Taking data from a text file to parse html page
No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of se.py or sel.py anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip file the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack. Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Saturday, August 26, 2006 5:47 AM Subject: Re: Taking data from a text file to parse html page Yes I know how to import modules... I think I found the problem, Linux handles upper and lower case differently, so for some reason you can't import SE but if you rename it to se it gives you the error that it can't find SEL which if you rename it will complain that that SEL isn't defined... Are you running Linux? Have you tested it with Linux? -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
Anthra Norell wrote: No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of se.py or sel.py anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip file the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack. The problem is the extension: SE.py is acceptable, while SE.PY is not. Georg -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
Yes! It just occurred to my that this could be the problem. I have to change that. Thanks for the hint. Frederic - Original Message - From: Georg Brandl [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Saturday, August 26, 2006 1:59 PM Subject: Re: Taking data from a text file to parse html page Anthra Norell wrote: No, I am not running Linux to any extent. But I am very strict about case. There is not a single instance of se.py or sel.py anywhere on my system. You' ll have to find out where lower case sneaks in on yours. The zip file preserves case and in the zip file the names are upper case. I am baffled. But I believe that an import tripping up on the wrong case can't be a hard nut to crack. The problem is the extension: SE.py is acceptable, while SE.PY is not. Georg -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same directory. That's all. Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in sys.path, because that is where import looks. Check for that directory's presence in the sys.path list: sys.path ['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win', 'C:\\PYTHON24\\lib\\lib-tk' (... etc)] Supposing it isn't there, add it: sys.path.append ('/python/code/other_peoples_stuff') import SE That should do it. Let me know if it works. Else just keep asking. Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Friday, August 25, 2006 4:40 AM Subject: Re: Taking data from a text file to parse html page SE looks very helpful... I'm having a hell of a time installing it though: - [EMAIL PROTECTED]:~/Desktop/SE-2.2$ sudo python SETUP.PY install running install running build running build_py file SEL.py (for module SEL) not found file SE.py (for module SE) not found file SEL.py (for module SEL) not found file SE.py (for module SE) not found -- Anthra Norell wrote: You may also want to look at this stream editor: http://cheeseshop.python.org/pypi/SE/2.2%20beta It allows multiple replacements in a definition format of utmost simplicity: your_example = ''' divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference ''' import SE Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes comments entirely even if they nest tags ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, a class=reference Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***): Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# *** This deletes the fragment # -- Peter Norvig, a class\=reference= # Or like this if Peter Norvig has to go too ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, quot; you can either translate or delete: Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# This deletes the fragment # -- Peter Norvig, a class=\\reference\\= # Or like this if Peter Norvig has to go too htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes. # Naming the file is all you need to do to include the replacements which it defines. ''') print Tag_Stripper (your_example) 'Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. ' -- Peter Norvig, If instead of htm2iso.se you write quot;= you delete it and your output will be: Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. -- Peter Norvig, Your Tag_Stripper also does files: print Tag_Stripper ('my_file.htm', 'my_file_without_tags') 'my_file_without_tags' A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a parser does a lot of work which you don't need. Regards Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Thursday, August 24, 2006 7:41 PM Subject: Re: Taking data from a text file to parse html page I found this http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gstq=replace+text+filer num=8#ad0ac6b1ac8cff51 Credit Jeremy Moles --- finds = ({, }, (, )) lines = file(foo.txt, r
Re: Taking data from a text file to parse html page
Yes I know how to import modules... I think I found the problem, Linux handles upper and lower case differently, so for some reason you can't import SE but if you rename it to se it gives you the error that it can't find SEL which if you rename it will complain that that SEL isn't defined... Are you running Linux? Have you tested it with Linux? Surely you write your own programs. (program_name.py). You import and run them. You may put SE.PY and SEL.PY into the same directory. That's all. Or if you prefer to keep other people's stuff in a different directory, just make sure that directory is in sys.path, because that is where import looks. Check for that directory's presence in the sys.path list: sys.path ['C:\\Python24\\Lib\\idlelib', 'C:\\', 'C:\\PYTHON24\\DLLs', 'C:\\PYTHON24\\lib', 'C:\\PYTHON24\\lib\\plat-win', 'C:\\PYTHON24\\lib\\lib-tk' (... etc)] Supposing it isn't there, add it: sys.path.append ('/python/code/other_peoples_stuff') import SE That should do it. Let me know if it works. Else just keep asking. Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Friday, August 25, 2006 4:40 AM Subject: Re: Taking data from a text file to parse html page SE looks very helpful... I'm having a hell of a time installing it though: - [EMAIL PROTECTED]:~/Desktop/SE-2.2$ sudo python SETUP.PY install running install running build running build_py file SEL.py (for module SEL) not found file SE.py (for module SE) not found file SEL.py (for module SEL) not found file SE.py (for module SE) not found -- Anthra Norell wrote: You may also want to look at this stream editor: http://cheeseshop.python.org/pypi/SE/2.2%20beta It allows multiple replacements in a definition format of utmost simplicity: your_example = ''' divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference ''' import SE Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes comments entirely even if they nest tags ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, a class=reference Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***): Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# *** This deletes the fragment # -- Peter Norvig, a class\=reference= # Or like this if Peter Norvig has to go too ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, quot; you can either translate or delete: Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# This deletes the fragment # -- Peter Norvig, a class=\\reference\\= # Or like this if Peter Norvig has to go too htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes. # Naming the file is all you need to do to include the replacements which it defines. ''') print Tag_Stripper (your_example) 'Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. ' -- Peter Norvig, If instead of htm2iso.se you write quot;= you delete it and your output will be: Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. -- Peter Norvig, Your Tag_Stripper also does files: print Tag_Stripper ('my_file.htm', 'my_file_without_tags') 'my_file_without_tags' A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a parser does a lot of work which you don't need. Regards Frederic - Original Message - From: DH [EMAIL
Re: Taking data from a text file to parse html page
DH, Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at stripping useless junk once you tell them what 'useless junk' is. Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is processed. Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Thursday, August 24, 2006 2:11 AM Subject: Taking data from a text file to parse html page Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any ideas on how to implement this? Thanks! -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
DH wrote: Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any ideas on how to implement this? Thanks! See Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/ it will parse even badly formed HTML and allow you to extract/change information as you wish. -Larry Bates -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
Frederic, Good points... I have a plain text file containing the html and words that I want removed(keywords) from the html file, after processing the html file it would save it as a plain text file. So the program would import the keywords, remove them from the html file and save the html file as something.txt. I would post the data but it's secret. I can post an example: index.html (html page) divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference replace.txt (keywords) div id=quote class=homepage-box divpemquot; quot;/em/p p-- Peter Norvig, a class=reference something.txt(file after editing) Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. Larry, I've looked into using BeatifulSoup but came to the conculsion that my idea would work better in the end. Thanks for the help. Anthra Norell wrote: DH, Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at stripping useless junk once you tell them what 'useless junk' is. Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is processed. Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Thursday, August 24, 2006 2:11 AM Subject: Taking data from a text file to parse html page Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any ideas on how to implement this? Thanks! -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
DH wrote: I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) [...] I've looked into using BeatifulSoup but came to the conculsion that my idea would work better in the end. You could use BeautifulSoup anyway for the junk-removal part and then do your magic. Even if it is not exactly what you want, it is a good idea to try to reuse modules that are good at what they do. -- Roberto Bonvallet -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
DH wrote: I have a plain text file containing the html and words that I want removed(keywords) from the html file, after processing the html file it would save it as a plain text file. So the program would import the keywords, remove them from the html file and save the html file as something.txt. I would post the data but it's secret. I can post an example: index.html (html page) divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference replace.txt (keywords) div id=quote class=homepage-box divpemquot; quot;/em/p p-- Peter Norvig, a class=reference something.txt(file after editing) Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. reading and writing files is described in the tutorial; see http://pytut.infogami.com/node9.html (scroll down to Reading and Writing Files) to do the replacement, you can use repeated calls to the replace method http://pyref.infogami.com/str.replace but that may cause problems if the replacement text contains things that should be replaced. for an efficient way to do a parallel replace, see: http://effbot.org/zone/python-replace.htm#multiple /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
I found this http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gstq=replace+text+filernum=8#ad0ac6b1ac8cff51 Credit Jeremy Moles --- finds = ({, }, (, )) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- I want something like --- finds = file(replace.txt) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- Fredrik Lundh wrote: DH wrote: I have a plain text file containing the html and words that I want removed(keywords) from the html file, after processing the html file it would save it as a plain text file. So the program would import the keywords, remove them from the html file and save the html file as something.txt. I would post the data but it's secret. I can post an example: index.html (html page) divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference replace.txt (keywords) div id=quote class=homepage-box divpemquot; quot;/em/p p-- Peter Norvig, a class=reference something.txt(file after editing) Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. reading and writing files is described in the tutorial; see http://pytut.infogami.com/node9.html (scroll down to Reading and Writing Files) to do the replacement, you can use repeated calls to the replace method http://pyref.infogami.com/str.replace but that may cause problems if the replacement text contains things that should be replaced. for an efficient way to do a parallel replace, see: http://effbot.org/zone/python-replace.htm#multiple /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Taking data from a text file to parse html page
You may also want to look at this stream editor: http://cheeseshop.python.org/pypi/SE/2.2%20beta It allows multiple replacements in a definition format of utmost simplicity: your_example = ''' divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference ''' import SE Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes comments entirely even if they nest tags ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, a class=reference Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***): Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# *** This deletes the fragment # -- Peter Norvig, a class\=reference= # Or like this if Peter Norvig has to go too ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, quot; you can either translate or delete: Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# This deletes the fragment # -- Peter Norvig, a class=\\reference\\= # Or like this if Peter Norvig has to go too htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes. # Naming the file is all you need to do to include the replacements which it defines. ''') print Tag_Stripper (your_example) 'Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. ' -- Peter Norvig, If instead of htm2iso.se you write quot;= you delete it and your output will be: Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. -- Peter Norvig, Your Tag_Stripper also does files: print Tag_Stripper ('my_file.htm', 'my_file_without_tags') 'my_file_without_tags' A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a parser does a lot of work which you don't need. Regards Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Thursday, August 24, 2006 7:41 PM Subject: Re: Taking data from a text file to parse html page I found this http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gstq=replace+text+filer num=8#ad0ac6b1ac8cff51 Credit Jeremy Moles --- finds = ({, }, (, )) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- I want something like --- finds = file(replace.txt) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- Fredrik Lundh wrote: DH wrote: I have a plain text file containing the html and words that I want removed(keywords) from the html file, after processing the html file it would save it as a plain text file. So the program would import the keywords, remove them from the html file and save the html file as something.txt. I would post the data but it's secret. I can post an example: index.html (html page) divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference replace.txt (keywords) div id=quote class=homepage-box divpemquot; quot;/em/p p-- Peter Norvig, a class=reference something.txt(file after editing) Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. reading and writing files is described in the tutorial; see http://pytut.infogami.com/node9.html (scroll down to Reading and Writing Files) to do the replacement
Re: Taking data from a text file to parse html page
SE looks very helpful... I'm having a hell of a time installing it though: - [EMAIL PROTECTED]:~/Desktop/SE-2.2$ sudo python SETUP.PY install running install running build running build_py file SEL.py (for module SEL) not found file SE.py (for module SE) not found file SEL.py (for module SEL) not found file SE.py (for module SE) not found -- Anthra Norell wrote: You may also want to look at this stream editor: http://cheeseshop.python.org/pypi/SE/2.2%20beta It allows multiple replacements in a definition format of utmost simplicity: your_example = ''' divpemquot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot;/em/p p-- Peter Norvig, a class=reference ''' import SE Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes comments entirely even if they nest tags ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, a class=reference Now you see a tag fragment. So you add another deletion to the Tag_Stripper (***): Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# *** This deletes the fragment # -- Peter Norvig, a class\=reference= # Or like this if Peter Norvig has to go too ''') print Tag_Stripper (your_example) quot;Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. quot; -- Peter Norvig, quot; you can either translate or delete: Tag_Stripper = SE.SE (''' ~(.|\n)*?~= # This pattern finds all tags and deletes them (replaces with nothing) ~!--(.|\n)*?--~= # This pattern deletes commentsentirely even if they nest tags a class\=reference=# This deletes the fragment # -- Peter Norvig, a class=\\reference\\= # Or like this if Peter Norvig has to go too htm2iso.se # This is a file (contained in the SE package that translates all ampersand codes. # Naming the file is all you need to do to include the replacements which it defines. ''') print Tag_Stripper (your_example) 'Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. ' -- Peter Norvig, If instead of htm2iso.se you write quot;= you delete it and your output will be: Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. -- Peter Norvig, Your Tag_Stripper also does files: print Tag_Stripper ('my_file.htm', 'my_file_without_tags') 'my_file_without_tags' A stream editor is not a substitute for a parser. It does handle more economically simple translation jobs like this one where a parser does a lot of work which you don't need. Regards Frederic - Original Message - From: DH [EMAIL PROTECTED] Newsgroups: comp.lang.python To: python-list@python.org Sent: Thursday, August 24, 2006 7:41 PM Subject: Re: Taking data from a text file to parse html page I found this http://groups.google.com/group/comp.lang.python/browse_thread/thread/d1bda6ebcfb060f9/ad0ac6b1ac8cff51?lnk=gstq=replace+text+filer num=8#ad0ac6b1ac8cff51 Credit Jeremy Moles --- finds = ({, }, (, )) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- I want something like --- finds = file(replace.txt) lines = file(foo.txt, r).readlines() for line in lines: for find in finds: if find in line: line.replace(find, ) print lines --- Fredrik Lundh wrote: DH wrote: I have a plain text file containing the html and words that I want removed(keywords) from the html file, after processing the html file it would save it as a plain text file. So the program would import the keywords, remove them from the html file and save the html file as something.txt. I would post the data but it's secret. I can post an example: index.html (html page) divpemquot;Python