Re: Is secretly downloading to your computer ?!
On 03/12/15 00:53, Laura Creighton wrote: > This is one of my favourite quotes of all time. Unfortunately, you > have it slightly wrong. The quote is: > Something must be done. This is something. Therefore we must do it. I wish people would check their email subjects before replying to this thread. I suspect part of the OP's intent is to have his assertion generate lots of traffic, all repeating the assertion via their subject heading... Unless of course you actually agree with his assertion! Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Generate config file from template using Python search and replace.
A program I am writing at present does exactly this and I simply do multiple calls to string.replace (see below) On 30/11/15 10:31, Mr Zaug wrote: > I seem to be heading in this direction. > > #!/usr/bin/env python > import re > from os.path import exists > > script, template_file = argv > print "Opening the template file..." > > with open (template_file, "r") as a_string: > data=a_string.read().replace('BRAND', 'Fluxotine') data=data.replace('STRING_2', 'New String 2') data=data.replace('STRING_3', 'New String 3') > print(data) > > So now the challenge is to use the read().replace magic for multiple values. It's crude, but it works well for me! -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Hi Paul, On 28/11/15 13:11, Paul Rubin wrote: > Rob Hills <rhi...@medimorphosis.com.au> writes: >> Note, in the beginning of this project, I looked at using "Beautiful >> Soup" but my reading and limited testing lead me to believe that it is >> designed for well-formed HTML/XML and therefore was unsuitable for the >> text/html soup I have. If that belief is incorrect, I'd be grateful for >> general tips about using Beautiful Soup in this scenario... > Beautiful Soup can deal with badly formed HTML pretty well, or at least > it could in earlier versions. It gives you several different parsing > options to choose from now. I think the default is lxml which is fast > but maybe more strict. Check what the others are and see if a loose > slow one is still there. It really is pretty slow so plan on a big > computation task if you're converting a large forum. I've had another look at Beautiful Soup and while it doesn't really help me much with urls (relative or absolute) embedded within text, it seems to do a good job of separating out links from the rest, so that could be useful in itself. WRT time, I'm converting about 65MB of data which currently takes 14 seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is pretty amazing performance for Python3, especially given my relatively crude coding skills. It'll be interesting to see if using Beautiful Soup adds significantly to that. > phpBB gets a bad rap that's maybe well-deserved but I don't know what to > suggest instead. I did start to investigate Python-based alternatives; I've not heard much good said about php, but I probably move in the wrong circles. However, our hosting service doesn't support Python so I stopped hunting. Plus there is a significant group of forum members who hold very strong opinions about the functionality they want and it took a lot of work to get them to agree on something! All that said, I'd be interested to see specific (and hopefully unbiased) info about phpBB's failings... Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Hi Laura, On 29/11/15 01:04, Laura Creighton wrote: > In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes: >> All that said, I'd be interested to see specific (and hopefully >> unbiased) info about phpBB's failings... > People I know of who run different bb software say that the spammers > really prefer phpBB. So keeping it spam free is about 4 times the > work as for, for instance, IPB. > > Hackers seem to like it too -- possibly due to this: > http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/ > > make sure you aren't vulnerable. Thanks for the link and the advice. Personally, I'd rather go with something based on a language I am reasonably familiar with (eg Python or Java) however it seems the vast bulk of Forum software is based on PHP :-( Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
Hi Grobu, On 28/11/15 15:07, Grobu wrote: > Is it safe to assume that all the relative (cross) links take one of > the following forms? : > > http://www.aeva.asn.au/forums/forum_posts.asp > www.aeva.asn.au/forums/forum_posts.asp > /forums/forum_posts.asp > /forum_posts.asp (are you really sure about this one?) > > If so, and if your goal boils down to converting all instances of old > style URLs to new style ones regardless of the context where they > appear, why would a regex fail to meet your needs? I'm actually not discounting anything and as I mentioned, I've already used some regex to extract the properly-formed URLs (those starting with http://). I was fortunately able to find some example regex that I could figure out enough to tweak for my purpose. Unfortunately, my small brain hurts whenever I try and understand what a piece of regex is doing and I don't like having bits in my code that hurt my brain. BTW, that's not meant to be an invitation to someone to produce some regex for me, if I can't find any other way of doing it, I'll try and create my own regex and come back here if I can't get that working. Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Find relative url in mixed text/html
Hi, For my sins I am migrating a volunteer association forum from one platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way through the process. Posts to our original forum comprise a soup of plain text, HTML and BBCodes. A post */may/* include links done as either standard HTML links ( http://blah.blah.com.au or even just www.blah.blah.com.au ). In my conversion process, I am trying to identify cross-links (links from one post on the forum to another) so I can convert them to links that will work in the new forum. My current code uses a Regular Expression (yes, I read the recent posts on this forum about regex and HTML!) to pull out "absolute" links ( starting with http:// ) and then I use Python to identify and convert the specific links I am interested in. However, the forum also contains "cross-links" done using relative links and I'm unsure how best to proceed with that one. Googling so far has not been helpful, but that might be me using the wrong search terms. Some examples of what I am talking about are: Post fragment containing an "Absolute" cross-link: ive made a new thread: http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958 converts to: ive made a new thread: /viewtopic.php?t=316=1958#1958 Post fragment containing a "Relative" cross-link: Battery Management SystemVeroboard prototype Needs converting to: Battery Management SystemVeroboard prototype So, my question is: What is the best way to extract a list of "relative links" from mixed text/html that I can then walk through to identify the specific ones I want to convert? Note, in the beginning of this project, I looked at using "Beautiful Soup" but my reading and limited testing lead me to believe that it is designed for well-formed HTML/XML and therefore was unsuitable for the text/html soup I have. If that belief is incorrect, I'd be grateful for general tips about using Beautiful Soup in this scenario... TIA, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: What meaning is of '#!python'?
On 15/11/15 10:18, Chris Angelico wrote: > On Sun, Nov 15, 2015 at 1:13 PM, fl <rxjw...@gmail.com> wrote: >> Excuse me. Below is copied from the .py file: >> >> #!python >> from numpy import * >> from numpy.random import * >> > Then someone doesn't know how to use a shebang (or is deliberately > abusing it), and you can ignore it. It starts with a hash, ergo it's a > comment. > > ChrisA Looks like the author of the script file has tried to create a Python Shell script. This link describes them in detail: http://www.dreamsyssoft.com/python-scripting-tutorial/intro-tutorial.php Not sure whether the example originally quoted would work, I imagine it might on some 'nix operating systems. The more common first line is: #!/usr/bin/env python If you start a script file with this line and make the file executable, you can then run the script from the command line without having to preface it with a reference to your Python executable. Eg: my-script.py versus python my-script.py HTH, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help w. PIP!
On 05/09/15 01:47, Cody Piersall wrote: > > On Fri, Sep 4, 2015 at 12:22 PM, Steve Burrus > <steveburru...@gmail.com <mailto:steveburru...@gmail.com>> wrote: > <..> > >> "echo %path% > >> > >> C:\Python34;C:\Python34\python.exe;C:\Python34\Scripts; It's a long time since I last used Windoze in anger, but that second path entry (C:\Python34\python.exe;) looks wrong to me. Unless Windoze has changed recently, you shouldn't have a program name in your path. IIRC, that's going to break all path entries that follow it, so it could be the cause of your problem (ie the "C:\Python34\Scripts;" part won't be accessible. Perhaps try deleting the "C:\Python34\python.exe;" entry from your PATH environment variable and see what happens. HTH, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help w. PIP!
On 05/09/15 08:55, MRAB wrote: > On 2015-09-05 01:35, Rob Hills wrote: >> On 05/09/15 01:47, Cody Piersall wrote: >>> > On Fri, Sep 4, 2015 at 12:22 PM, Steve Burrus >>> <steveburru...@gmail.com <mailto:steveburru...@gmail.com>> wrote: >>> >> <..> >>> >> "echo %path% >>> >> >>> >> C:\Python34;C:\Python34\python.exe;C:\Python34\Scripts; >> >> It's a long time since I last used Windoze in anger, but that second >> path entry (C:\Python34\python.exe;) looks wrong to me. Unless Windoze >> has changed recently, you shouldn't have a program name in your path. >> IIRC, that's going to break all path entries that follow it, so it could >> be the cause of your problem (ie the "C:\Python34\Scripts;" part won't >> be accessible. >> >> Perhaps try deleting the "C:\Python34\python.exe;" entry from your PATH >> environment variable and see what happens. >> > It should be a list of folder paths. Including a file path doesn't > appear to break it, and, in fact, I'd be surprised if it did; it should > just keep searching, much like it should if the folder were missing. You're probably right, but my recollection of Windoze is that it was very easily broken, hence my migration to Linux many moons ago. I reckon it wouldn't hurt to try getting rid of the invalid path entry anyway. Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Reading \n unescaped from a file
Hi Chris, On 03/09/15 06:10, Chris Angelico wrote: > On Wed, Sep 2, 2015 at 12:03 PM, Rob Hills <rhi...@medimorphosis.com.au> > wrote: >> My mapping file contents look like this: >> >> \r = \\n >> “ = > Oh, lovely. Code page 1252 when you're expecting UTF-8. Sadly, you're > likely to have to cope with a whole pile of other mojibake if that > happens :( Yeah, tell me about it!!! > Technically, what's happening is that your "\r" is literally a > backslash followed by the letter r; the transformation of backslash > sequences into single characters is part of Python source code > parsing. (Incidentally, why do you want to change a carriage return > into backslash-n? Seems odd.) > > Probably the easiest solution would be a simple and naive replace(), > looking for some very specific strings and ignoring everything else. > Easy to do, but potentially confusing down the track if someone tries > something fancy :) > > line = line.split('#')[:1][0].strip() # trim any trailing comments > line = line.replace(r"\r", "\r") # repeat this for as many backslash > escapes as you want to handle > > Be aware that this, while simple, is NOT capable of handling escaped > backslashes. In Python, "\\r" comes out the same as r"\r", but with > this parser, it would come out the same as "\\\r". But it might be > sufficient for you. Thanks for the explanation which has helped me understand the problem. I also tried your approach but wound up with output data that somehow had every single character escaped :-( I've since decided I was being too obsessive trying to load *everything* from my mapping file and have simply hard-coded my two escaped character replacements for now and moved on to more important problems (ie the Windoze Character soup that comprises my data and which I have to clean up!). Thanks again, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Reading \n unescaped from a file
Hi, On 03/09/15 06:31, MRAB wrote: > On 2015-09-02 03:03, Rob Hills wrote: >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn = dataIn.replace('','<') # Tidy up < character >> dataIn = dataIn.replace('','>') # Tidy up < character >> dataIn = dataIn.replace('','o') # No idea why but lots of >> these: convert to 'o' character >> dataIn = dataIn.replace('','f') # .. and these: convert to >> 'f' character >> dataIn = dataIn.replace('','e') # .. 'e' >> dataIn = dataIn.replace('','O') # .. 'O' >> > The problem with this approach is that the order of the replacements > matters. For example, changing '' to '<' and then '' to '&' > can give a different result to changing '' to '&' and then '' > to '<'. If you started with the string 'lt;', then the first order > would go 'lt;' => 'lt;' => '', whereas the second order > would go 'lt;' => '' => '<'. Ah yes, thanks for reminding me about that. I've since modified my code to use a collections.OrderedDict to store my mappings. ... >> This all works "as advertised" */except/* for the '\r' => '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to 'n' when they are read in from >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >> * change the way I represent '\r' and '\\n' in my mapping file; or >> * transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> > Try ast.literal_eval, although you'd need to make it look like a string > literal first: Thanks for the suggestion. For now, I've decided I was being too pedantic trying to load my two escaped strings from a file and I've simply hard coded them and moved on to other issues. I'll try this idea later on though. Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list
Re: Reading \n unescaped from a file
Hi Friedrich, On 03/09/15 16:40, Friedrich Rentsch wrote: > > On 09/02/2015 04:03 AM, Rob Hills wrote: >> Hi, >> >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn = dataIn.replace('','<') # Tidy up < character >> dataIn = dataIn.replace('','>') # Tidy up < character >> dataIn = dataIn.replace('','o') # No idea why but lots of >> these: convert to 'o' character >> dataIn = dataIn.replace('','f') # .. and these: convert to >> 'f' character >> dataIn = dataIn.replace('','e') # .. 'e' >> dataIn = dataIn.replace('','O') # .. 'O' >> >> These statements transform my data correctly, but the list of statements >> grows as I test the data so I thought it made sense to store the >> replacement mappings in a file, read them into a dict and loop through >> that to do the cleaning up, like this: >> >> with open(fileName, 'r+t', encoding='utf-8') as mapFile: >> for line in mapFile: >> line = line.strip() >> try: >> if (line) and not line.startswith('#'): >> line = line.split('#')[:1][0].strip() # trim >> any trailing comments >> name, value = line.split('=') >> name = name.strip() >> self.filterMap[name]=value.strip() >> except: >> self.logger.error('exception occurred parsing >> line [{0}] in file [{1}]'.format(line, fileName)) >> raise >> >> Elsewhere, I use the following code to do the actual cleaning up: >> >> def filter(self, dataIn): >> if dataIn: >> for token, replacement in self.filterMap.items(): >> dataIn = dataIn.replace(token, replacement) >> return dataIn >> >> >> My mapping file contents look like this: >> >> \r = \\n >> â = >> = < >> = > >> = >> = F >> = o >> = f >> = e >> = O >> >> This all works "as advertised" */except/* for the '\r' => '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to 'n' when they are read in from >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >>* change the way I represent '\r' and '\\n' in my mapping file; or >>* transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> >> TIA, >> >> > > I have had this problem too and can propose a solution ready to run > out of my toolbox: > > > class editor: > > def compile (self, replacements): > targets, substitutes = zip (*replacements) > re_targets = [re.escape (item) for item in targets] > re_targets.sort (reverse = True) > self.targets_set = set (targets) > self.table = dict (replacements) > regex_string = '|'.join (re_targets) > self.regex = re.compile (regex_string, re.DOTALL) > > def edit (self, text, eat = False): > hits = self.regex.findall (text) > nohits = self.regex.split (text) > valid_hits = set (hits) & self.targets_set # Ignore targets > with illegal re modifiers. > if valid_hits: > substitutes = [self.table [item] for item in hits if item > in valid_hits] + [] # Make lengths equal for zip to work right > if eat: > output = ''.join (substitutes) > else: > zipped = zip (nohits, substitutes) > output = ''.join (list (reduce (lambda a, b: a + b, > [zipped][0]))) + nohits [-1] > else: > if eat: > output = '' > else: > output = input > return output > > >>> substitutions = ( > ('\r', '\n'), > ('', '<'), > ('', '>'), > ('', 'o'), > ('', 'f'), >
Reading \n unescaped from a file
Hi, I am developing code (Python 3.4) that transforms text data from one format to another. As part of the process, I had a set of hard-coded str.replace(...) functions that I used to clean up the incoming text into the desired output format, something like this: dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds dataIn = dataIn.replace('','<') # Tidy up < character dataIn = dataIn.replace('','>') # Tidy up < character dataIn = dataIn.replace('','o') # No idea why but lots of these: convert to 'o' character dataIn = dataIn.replace('','f') # .. and these: convert to 'f' character dataIn = dataIn.replace('','e') # .. 'e' dataIn = dataIn.replace('','O') # .. 'O' These statements transform my data correctly, but the list of statements grows as I test the data so I thought it made sense to store the replacement mappings in a file, read them into a dict and loop through that to do the cleaning up, like this: with open(fileName, 'r+t', encoding='utf-8') as mapFile: for line in mapFile: line = line.strip() try: if (line) and not line.startswith('#'): line = line.split('#')[:1][0].strip() # trim any trailing comments name, value = line.split('=') name = name.strip() self.filterMap[name]=value.strip() except: self.logger.error('exception occurred parsing line [{0}] in file [{1}]'.format(line, fileName)) raise Elsewhere, I use the following code to do the actual cleaning up: def filter(self, dataIn): if dataIn: for token, replacement in self.filterMap.items(): dataIn = dataIn.replace(token, replacement) return dataIn My mapping file contents look like this: \r = \\n â = = < = > = = F = o = f = e = O This all works "as advertised" */except/* for the '\r' => '\\n' replacement. Debugging the code, I see that my '\r' character is "escaped" to '\\r' and the '\\n' to 'n' when they are read in from the file. I've been googling hard and reading the Python docs, trying to get my head around character encoding, but I just can't figure out how to get these bits of code to do what I want. It seems to me that I need to either: * change the way I represent '\r' and '\\n' in my mapping file; or * transform them somehow when I read them in However, I haven't figured out how to do either of these. TIA, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list