Re: Is secretly downloading to your computer ?!

2015-12-02 Thread Rob Hills
On 03/12/15 00:53, Laura Creighton wrote:

> This is one of my favourite quotes of all time.  Unfortunately, you
> have it slightly wrong.  The quote is:
> Something must be done.  This is something.  Therefore we must do it.

I wish people would check their email subjects before replying to this
thread.  I suspect part of the OP's intent is to have his assertion
generate lots of traffic, all repeating the assertion via their subject
heading...

Unless of course you actually agree with his assertion!

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Generate config file from template using Python search and replace.

2015-11-29 Thread Rob Hills
A program I am writing at present does exactly this and I simply do
multiple calls to string.replace (see below)

On 30/11/15 10:31, Mr Zaug wrote:
> I seem to be heading in this direction.
>
> #!/usr/bin/env python
> import re
> from os.path import exists
>
> script, template_file = argv
> print "Opening the template file..."
>
> with open (template_file, "r") as a_string:
> data=a_string.read().replace('BRAND', 'Fluxotine')

data=data.replace('STRING_2', 'New String 2')
data=data.replace('STRING_3', 'New String 3')

> print(data)
>
> So now the challenge is to use the read().replace magic for multiple values.

It's crude, but it works well for me!

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills <rhi...@medimorphosis.com.au> writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Laura,

On 29/11/15 01:04, Laura Creighton wrote:
> In a message of Sun, 29 Nov 2015 00:25:07 +0800, Rob Hills writes:
>> All that said, I'd be interested to see specific (and hopefully
>> unbiased) info about phpBB's failings...
> People I know of who run different bb software say that the spammers
> really prefer phpBB.  So keeping it spam free is about 4 times the
> work as for, for instance, IPB.
>
> Hackers seem to like it too -- possibly due to this:
> http://defensivedepth.com/2009/03/03/anatomy-of-a-hack-the-phpbbcom-attack/
>
> make sure you aren't vulnerable.

Thanks for the link and the advice.

Personally, I'd rather go with something based on a language I am
reasonably familiar with (eg Python or Java) however it seems the vast
bulk of Forum software is based on PHP :-(

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-28 Thread Rob Hills
Hi Grobu,

On 28/11/15 15:07, Grobu wrote:
> Is it safe to assume that all the relative (cross) links take one of
> the following forms? :
>
> http://www.aeva.asn.au/forums/forum_posts.asp
> www.aeva.asn.au/forums/forum_posts.asp
> /forums/forum_posts.asp
> /forum_posts.asp (are you really sure about this one?)
>
> If so, and if your goal boils down to converting all instances of old
> style URLs to new style ones regardless of the context where they
> appear, why would a regex fail to meet your needs?

I'm actually not discounting anything and as I mentioned, I've already
used some regex to extract the properly-formed URLs (those starting with
http://).  I was fortunately able to find some example regex that I
could figure out enough to tweak for my purpose.  Unfortunately, my
small brain hurts whenever I try and understand what a piece of regex is
doing and I don't like having bits in my code that hurt my brain. 

BTW, that's not meant to be an invitation to someone to produce some
regex for me, if I can't find any other way of doing it, I'll try and
create my own regex and come back here if I can't get that working.

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Find relative url in mixed text/html

2015-11-27 Thread Rob Hills
Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms. 

Some examples of what I am talking about are:

Post fragment containing an "Absolute" cross-link:

ive made a new thread:
http://www.aeva.asn.au/forums/forum_posts.asp?TID=316=1958#1958


converts to:


ive made a new thread:
/viewtopic.php?t=316=1958#1958

Post fragment containing a "Relative" cross-link:

Battery Management SystemVeroboard prototype

Needs converting to:

Battery Management SystemVeroboard prototype

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What meaning is of '#!python'?

2015-11-14 Thread Rob Hills
On 15/11/15 10:18, Chris Angelico wrote:
> On Sun, Nov 15, 2015 at 1:13 PM, fl <rxjw...@gmail.com> wrote:
>> Excuse me. Below is copied from the .py file:
>>
>> #!python
>> from numpy import *
>> from numpy.random import *
>>
> Then someone doesn't know how to use a shebang (or is deliberately
> abusing it), and you can ignore it. It starts with a hash, ergo it's a
> comment.
>
> ChrisA

Looks like the author of the script file has tried to create a Python
Shell script.  This link describes them in detail:

http://www.dreamsyssoft.com/python-scripting-tutorial/intro-tutorial.php

Not sure whether the example originally quoted would work, I imagine it
might on some 'nix operating systems.

The more common first line is:

#!/usr/bin/env python

If you start a script file with this line and make the file executable,
you can then run the script from the command line without having to
preface it with a reference to your Python executable.  Eg:

    my-script.py


versus

python my-script.py


HTH,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help w. PIP!

2015-09-04 Thread Rob Hills
On 05/09/15 01:47, Cody Piersall wrote:
> > On Fri, Sep 4, 2015 at 12:22 PM, Steve Burrus
> <steveburru...@gmail.com <mailto:steveburru...@gmail.com>> wrote:
>
<..>
> >> "echo %path%
> >>
> >> C:\Python34;C:\Python34\python.exe;C:\Python34\Scripts;

It's a long time since I last used Windoze in anger, but that second
path entry (C:\Python34\python.exe;) looks wrong to me.  Unless Windoze
has changed recently, you shouldn't have a program name in your path. 
IIRC, that's going to break all path entries that follow it, so it could
be the cause of your problem (ie the "C:\Python34\Scripts;" part won't
be accessible.

Perhaps try deleting the "C:\Python34\python.exe;" entry from your PATH
environment variable and see what happens.

HTH,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help w. PIP!

2015-09-04 Thread Rob Hills
On 05/09/15 08:55, MRAB wrote:
> On 2015-09-05 01:35, Rob Hills wrote:
>> On 05/09/15 01:47, Cody Piersall wrote:
>>> > On Fri, Sep 4, 2015 at 12:22 PM, Steve Burrus
>>> <steveburru...@gmail.com <mailto:steveburru...@gmail.com>> wrote:
>>>
>> <..>
>>> >> "echo %path%
>>> >>
>>> >> C:\Python34;C:\Python34\python.exe;C:\Python34\Scripts;
>>
>> It's a long time since I last used Windoze in anger, but that second
>> path entry (C:\Python34\python.exe;) looks wrong to me.  Unless Windoze
>> has changed recently, you shouldn't have a program name in your path.
>> IIRC, that's going to break all path entries that follow it, so it could
>> be the cause of your problem (ie the "C:\Python34\Scripts;" part won't
>> be accessible.
>>
>> Perhaps try deleting the "C:\Python34\python.exe;" entry from your PATH
>> environment variable and see what happens.
>>
> It should be a list of folder paths. Including a file path doesn't
> appear to break it, and, in fact, I'd be surprised if it did; it should
> just keep searching, much like it should if the folder were missing.

You're probably right, but my recollection of Windoze is that it was
very easily broken, hence my migration to Linux many moons ago.  I
reckon it wouldn't hurt to try getting rid of the invalid path entry anyway.

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reading \n unescaped from a file

2015-09-03 Thread Rob Hills
Hi Chris,

On 03/09/15 06:10, Chris Angelico wrote:
> On Wed, Sep 2, 2015 at 12:03 PM, Rob Hills <rhi...@medimorphosis.com.au> 
> wrote:
>> My mapping file contents look like this:
>>
>> \r = \\n
>> “ = 
> Oh, lovely. Code page 1252 when you're expecting UTF-8. Sadly, you're
> likely to have to cope with a whole pile of other mojibake if that
> happens :(

Yeah, tell me about it!!!

> Technically, what's happening is that your "\r" is literally a
> backslash followed by the letter r; the transformation of backslash
> sequences into single characters is part of Python source code
> parsing. (Incidentally, why do you want to change a carriage return
> into backslash-n? Seems odd.)
>
> Probably the easiest solution would be a simple and naive replace(),
> looking for some very specific strings and ignoring everything else.
> Easy to do, but potentially confusing down the track if someone tries
> something fancy :)
>
> line = line.split('#')[:1][0].strip() # trim any trailing comments
> line = line.replace(r"\r", "\r") # repeat this for as many backslash
> escapes as you want to handle
>
> Be aware that this, while simple, is NOT capable of handling escaped
> backslashes. In Python, "\\r" comes out the same as r"\r", but with
> this parser, it would come out the same as "\\\r". But it might be
> sufficient for you.

Thanks for the explanation which has helped me understand the problem. 
I also tried your approach but wound up with output data that somehow
had every single character escaped :-(

I've since decided I was being too obsessive trying to load *everything*
from my mapping file and have simply hard-coded my two escaped character
replacements for now and moved on to more important problems (ie the
Windoze Character soup that comprises my data and which I have to clean
up!).

Thanks again,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reading \n unescaped from a file

2015-09-03 Thread Rob Hills
Hi,

On 03/09/15 06:31, MRAB wrote:
> On 2015-09-02 03:03, Rob Hills wrote:
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>>  dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>  dataIn = dataIn.replace('','<') # Tidy up < character
>>  dataIn = dataIn.replace('','>') # Tidy up < character
>>  dataIn = dataIn.replace('','o') # No idea why but lots of
>> these: convert to 'o' character
>>  dataIn = dataIn.replace('','f') # .. and these: convert to
>> 'f' character
>>  dataIn = dataIn.replace('','e') # ..  'e'
>>  dataIn = dataIn.replace('','O') # ..  'O'
>>
> The problem with this approach is that the order of the replacements
> matters. For example, changing '' to '<' and then '' to '&'
> can give a different result to changing '' to '&' and then ''
> to '<'. If you started with the string 'lt;', then the first order
> would go 'lt;' => 'lt;' => '', whereas the second order
> would go 'lt;' => '' => '<'.

Ah yes, thanks for reminding me about that.  I've since modified my code
to use a collections.OrderedDict to store my mappings.

...

>> This all works "as advertised" */except/* for the '\r' => '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to 'n' when they are read in from
>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get
>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>>   * change the way I represent '\r' and '\\n' in my mapping file; or
>>   * transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
> Try ast.literal_eval, although you'd need to make it look like a string
> literal first:

Thanks for the suggestion.  For now, I've decided I was being too
pedantic trying to load my two escaped strings from a file and I've
simply hard coded them and moved on to other issues.  I'll try this idea
later on though.

Cheers,

-- 
Rob Hills
Waikiki, Western Australia

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Reading \n unescaped from a file

2015-09-03 Thread Rob Hills
Hi Friedrich,

On 03/09/15 16:40, Friedrich Rentsch wrote:
>
> On 09/02/2015 04:03 AM, Rob Hills wrote:
>> Hi,
>>
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>>  dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>  dataIn = dataIn.replace('','<') # Tidy up < character
>>  dataIn = dataIn.replace('','>') # Tidy up < character
>>  dataIn = dataIn.replace('','o') # No idea why but lots of
>> these: convert to 'o' character
>>  dataIn = dataIn.replace('','f') # .. and these: convert to
>> 'f' character
>>  dataIn = dataIn.replace('','e') # ..  'e'
>>  dataIn = dataIn.replace('','O') # ..  'O'
>>
>> These statements transform my data correctly, but the list of statements
>> grows as I test the data so I thought it made sense to store the
>> replacement mappings in a file, read them into a dict and loop through
>> that to do the cleaning up, like this:
>>
>>  with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>>  for line in mapFile:
>>  line = line.strip()
>>  try:
>>  if (line) and not line.startswith('#'):
>>  line = line.split('#')[:1][0].strip() # trim
>> any trailing comments
>>  name, value = line.split('=')
>>  name = name.strip()
>>  self.filterMap[name]=value.strip()
>>  except:
>>  self.logger.error('exception occurred parsing
>> line [{0}] in file [{1}]'.format(line, fileName))
>>  raise
>>
>> Elsewhere, I use the following code to do the actual cleaning up:
>>
>>  def filter(self, dataIn):
>>  if dataIn:
>>  for token, replacement in self.filterMap.items():
>>  dataIn = dataIn.replace(token, replacement)
>>  return dataIn
>>
>>
>> My mapping file contents look like this:
>>
>> \r = \\n
>> “ = 
>>  = <
>>  = >
>>  = 
>>  = F
>>  = o
>>  = f
>>  = e
>>  = O
>>
>> This all works "as advertised" */except/* for the '\r' => '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to 'n' when they are read in from
>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get
>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>>* change the way I represent '\r' and '\\n' in my mapping file; or
>>* transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
>> TIA,
>>
>>
>
> I have had this problem too and can propose a solution ready to run
> out of my toolbox:
>
>
> class editor:
>
> def compile (self, replacements):
> targets, substitutes = zip (*replacements)
> re_targets = [re.escape (item) for item in targets]
> re_targets.sort (reverse = True)
> self.targets_set = set (targets)
> self.table = dict (replacements)
> regex_string = '|'.join (re_targets)
> self.regex = re.compile (regex_string, re.DOTALL)
>
> def edit (self, text, eat = False):
> hits = self.regex.findall (text)
> nohits = self.regex.split (text)
> valid_hits = set (hits) & self.targets_set  # Ignore targets
> with illegal re modifiers.
> if valid_hits:
> substitutes = [self.table [item] for item in hits if item
> in valid_hits] + []  # Make lengths equal for zip to work right
> if eat:
> output = ''.join (substitutes)
> else:
> zipped = zip (nohits, substitutes)
> output = ''.join (list (reduce (lambda a, b: a + b,
> [zipped][0]))) + nohits [-1]
> else:
> if eat:
> output = ''
> else:
> output = input
> return output
>
> >>> substitutions = (
> ('\r', '\n'),
> ('', '<'),
> ('', '>'),
> ('', 'o'),
> ('', 'f'),
>

Reading \n unescaped from a file

2015-09-02 Thread Rob Hills
Hi,

I am developing code (Python 3.4) that transforms text data from one
format to another.

As part of the process, I had a set of hard-coded str.replace(...)
functions that I used to clean up the incoming text into the desired
output format, something like this:

dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
dataIn = dataIn.replace('','<') # Tidy up < character
dataIn = dataIn.replace('','>') # Tidy up < character
dataIn = dataIn.replace('','o') # No idea why but lots of these: 
convert to 'o' character
dataIn = dataIn.replace('','f') # .. and these: convert to 'f' 
character
dataIn = dataIn.replace('','e') # ..  'e'
dataIn = dataIn.replace('','O') # ..  'O'

These statements transform my data correctly, but the list of statements
grows as I test the data so I thought it made sense to store the
replacement mappings in a file, read them into a dict and loop through
that to do the cleaning up, like this:

with open(fileName, 'r+t', encoding='utf-8') as mapFile:
for line in mapFile:
line = line.strip()
try:
if (line) and not line.startswith('#'):
line = line.split('#')[:1][0].strip() # trim any 
trailing comments
name, value = line.split('=')
name = name.strip()
self.filterMap[name]=value.strip()
except:
self.logger.error('exception occurred parsing line [{0}] in 
file [{1}]'.format(line, fileName))
raise

Elsewhere, I use the following code to do the actual cleaning up:

def filter(self, dataIn):
if dataIn:
for token, replacement in self.filterMap.items():
dataIn = dataIn.replace(token, replacement)
return dataIn


My mapping file contents look like this:

\r = \\n
“ = 
 = <
 = >
 = 
 = F
 = o
 = f
 = e
 = O

This all works "as advertised" */except/* for the '\r' => '\\n'
replacement. Debugging the code, I see that my '\r' character is
"escaped" to '\\r' and the '\\n' to 'n' when they are read in from
the file.

I've been googling hard and reading the Python docs, trying to get my
head around character encoding, but I just can't figure out how to get
these bits of code to do what I want.

It seems to me that I need to either:

  * change the way I represent '\r' and '\\n' in my mapping file; or
  * transform them somehow when I read them in

However, I haven't figured out how to do either of these.

TIA,

-- 
Rob Hills
Waikiki, Western Australia
-- 
https://mail.python.org/mailman/listinfo/python-list