Re: Dowloading package dependencies from locked down machine

2020-07-27 Thread Andrew McLean

On 26/07/2020 14:07, Mike Dewhirst wrote:
I think your best bet is to make a formal business case to your IT 
people and explain what's in it for them. If they hold all the cards 
you defeat them at your peril.


The issue is that the IT department thinks that installing the full 
power of Python scripting on an Internet facing machine is inconsistent 
with the "Cyber Essentials Plus" accreditiation that they need to win 
Government contracts. I'm trying to come up with an alternative that 
would be acceptable to them, I'm not going behind their backs.


I wonder, would it be possible to create a standalone executable version 
of pip with py2exe or similar?


- Andrew


--
https://mail.python.org/mailman/listinfo/python-list


Dowloading package dependencies from locked down machine

2020-07-26 Thread Andrew McLean
At work my only Internet access is via a locked-down PC. The IT 
department are not willing to install Python on it [1]. Ideally I would 
download packages and their dependencies from PyPi using "pip download" 
at the command line. Any better solutions than downloading the package 
in a browser, finding its dependencies manually, downloading these and 
so on recursively?


My dream solution would be for PyPi to provide a link to a zip file that 
bundled up a package and its dependencies, but I realise that this is 
probably a very niche requirement.


- Andrew

[1] Apparently they think installing Python would be incompatible with 
their Cyber Essentials Plus security accreditation, although it's 
apparently fine to have Microsoft Office 365 with a build in VBA 
interpreter!



--
https://mail.python.org/mailman/listinfo/python-list


Nesting concurrent.futures.ThreadPoolExecutor

2017-07-20 Thread Andrew McLean
I have a program where I am currently using a 
concurrent.futures.ThreadPoolExecutor to run multiple tasks 
concurrently. These tasks are typically I/O bound, involving access to 
local databases and remote REST APIs. However, these tasks could 
themselves be split into subtasks, which would also benefit from 
concurrency.


What I am hoping is that it is safe to use a 
concurrent.futures.ThreadPoolExecutor within the tasks. I have coded up 
a toy example, which seems to work. However, I'd like some confidence 
that this is intentional. Concurrency is notoriously tricky.


I very much hope this is safe, because otherwise it would not be safe to 
use a ThreadPoolExecutor to execute arbitrary code, in case it also used 
concurrent.futures to exploit concurrency.


Here is the toy example:

|importconcurrent.futures definner(i,j):returni,j,i**j 
defouter(i):withconcurrent.futures.ThreadPoolExecutor(max_workers=5)asexecutor:futures 
={executor.submit(inner,i,j):j forj inrange(5)}results =[]forfuture 
inconcurrent.futures.as_completed(futures):results.append(future.result())returnresults 
defmain():withconcurrent.futures.ThreadPoolExecutor(max_workers=5)asexecutor:futures 
={executor.submit(outer,i):i fori inrange(10)}results =[]forfuture 
inconcurrent.futures.as_completed(futures):results.extend(future.result())print(results)if__name__ 
=="__main__":main()|
I have previously posted this on Stack Overflow, but didn't get any 
replies. Apologies if you are seeing this twice.


https://stackoverflow.com/questions/44989473/nesting-concurrent-futures-threadpoolexecutor


--
https://mail.python.org/mailman/listinfo/python-list


Re: Real-world use of concurrent.futures

2014-05-13 Thread Andrew McLean
On 08/05/2014 21:44, Ian Kelly wrote:
 On May 8, 2014 12:57 PM, Andrew McLean li...@andros.org.uk
 mailto:li...@andros.org.uk wrote:
  So far so good. However, I thought this would be an opportunity to
  explore concurrent.futures and to see whether it offered any benefits
  over the more explicit approach discussed above. The problem I am having
  is that all the discussions I can find of the use of concurrent.futures
  show use with toy problems involving just a few tasks. The url
  downloader in the documentation is typical, it proceeds as follows:
 
  1. Get an instance of concurrent.futuresThreadPoolExecutor
  2. Submit a few tasks to the executer
  3. Iterate over the results using concurrent.futures.as_completed
 
  That's fine, but I suspect that isn't a helpful pattern if I have a very
  large number of tasks. In my case I could run out of memory if I tried
  submitting all of the tasks to the executor before processing any of the
  results.

 I thought that ThreadPoolExecutor.map would handle this transparently
 if you passed it a lazy iterable such as a generator.  From my testing
 though, that seems not to be the case; with a generator of 10
 items and a pool of 2 workers, the entire generator was consumed
 before any results were returned.

  I'm guessing what I want to do is, submit tasks in batches of perhaps a
  few hundred, iterate over the results until most are complete, then
  submit some more tasks and so on. I'm struggling to see how to do this
  elegantly without a lot of messy code just there to do bookkeeping.
  This can't be an uncommon scenario. Am I missing something, or is this
  just not a job suitable for futures?

 I don't think it needs to be messy. Something like this should do
 the trick, I think:

 from concurrent.futures import *
 from itertools import islice

 def batched_pool_runner(f, iterable, pool, batch_size):
   it = iter(iterable)
   # Submit the first batch of tasks.
   futures = set(pool.submit(f, x) for x in islice(it, batch_size))
   while futures:
 done, futures = wait(futures, return_when=FIRST_COMPLETED)
 # Replenish submitted tasks up to the number that completed.
 futures.update(pool.submit(f, x) for x in islice(it, len(done)))
 yield from done

That worked very nicely, thank you.  I think that would make a good
recipe, whether for the documentation or elsewhere. I suspect I'm not
the only person that would benefit from something to bridge the gap
between a toy example and something practical.

Andrew



-- 
https://mail.python.org/mailman/listinfo/python-list


Real-world use of concurrent.futures

2014-05-08 Thread Andrew McLean
I have a problem that would benefit from a multithreaded implementation
and having trouble understanding how to approach it using
concurrent.futures.

The details don't really matter, but it will probably help to be
explicit. I have a large CSV file that contains a lot of fields, amongst
them one containing email addresses. I want to write a program that
validates the email addresses by checking that the domain names have a
valid MX record. The output will be a copy of the file with any invalid
email addresses removed. Because of latency in the DNS lookup this could
benefit from multithreading.

I have written similar code in the past using explicit threads
communicating via queues. For this example, I could have a thread that
read the file using csv.DictReader, putting dicts containing records
from the input file into a (finite length) queue. Then I would have a
number of worker threads reading the queue, performing the validation
and putting validated results in a second queue. A final thread would
read from the second queue writing the results to the output file.

So far so good. However, I thought this would be an opportunity to
explore concurrent.futures and to see whether it offered any benefits
over the more explicit approach discussed above. The problem I am having
is that all the discussions I can find of the use of concurrent.futures
show use with toy problems involving just a few tasks. The url
downloader in the documentation is typical, it proceeds as follows:

1. Get an instance of concurrent.futuresThreadPoolExecutor
2. Submit a few tasks to the executer
3. Iterate over the results using concurrent.futures.as_completed

That's fine, but I suspect that isn't a helpful pattern if I have a very
large number of tasks. In my case I could run out of memory if I tried
submitting all of the tasks to the executor before processing any of the
results.

I'm guessing what I want to do is, submit tasks in batches of perhaps a
few hundred, iterate over the results until most are complete, then
submit some more tasks and so on. I'm struggling to see how to do this
elegantly without a lot of messy code just there to do bookkeeping.
This can't be an uncommon scenario. Am I missing something, or is this
just not a job suitable for futures?

Regards,

Andrew


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Real-world use of concurrent.futures

2014-05-08 Thread Andrew McLean
On 08/05/2014 20:06, Chris Angelico wrote:
 On Fri, May 9, 2014 at 4:55 AM, Andrew McLean li...@andros.org.uk wrote:
 Because of latency in the DNS lookup this could
 benefit from multithreading.
 Before you go too far down roads that are starting to look
 problematic: A DNS lookup is a UDP packet out and a UDP packet in
 (ignoring the possibility of TCP queries, which you probably won't be
 doing here). Maybe it would be easier to implement it as asynchronous
 networking? I don't know that Python makes it easy for you to
 construct DNS requests and parse DNS responses; that's something more
 in Pike's line of work. But it may be more possible to outright do the
 DNS query asynchronously. TBH I haven't looked into it; but it's
 another option to consider.

 Separately from your programming model, though, how are you handling
 timeouts? Any form of DNS error (NXDOMAIN being the most likely), and
 the sort-of-error-but-not-error state of getting a response with no
 answer, indicates that the address is invalid; but what if you just
 don't hear back from the server? Will that mark addresses off as dead?

 ChrisA

I've done this on a very small scale in the past. I used

http://www.dnspython.org/

to do the heavy lifting. The relevant bits of code looks like:

 # Set up the default dns resolver and add a cache
 dns.resolver.default_resolver = dns.resolver.Resolver()
 dns.resolver.default_resolver.cache = dns.resolver.Cache()
and
 try:
 result = dns.resolver.query(domain, 'MX')
 return True
 except dns.resolver.NXDOMAIN:
 return False
 except dns.resolver.NoAnswer:
 return False
 except dns.resolver.Timeout:
 print *** timeout looking for the MX record for the domain:
 %s % domain
 return False

You are right, I'll need to do something more sophisticated when I
encounter a timeout, but I think that's a matter of detail.

Andrew

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Real-world use of concurrent.futures

2014-05-08 Thread Andrew McLean
On 08/05/2014 21:44, Ian Kelly wrote:
 I don't think it needs to be messy. Something like this should do
 the trick, I think:

 from concurrent.futures import *
 from itertools import islice

 def batched_pool_runner(f, iterable, pool, batch_size):
   it = iter(iterable)
   # Submit the first batch of tasks.
   futures = set(pool.submit(f, x) for x in islice(it, batch_size))
   while futures:
 done, futures = wait(futures, return_when=FIRST_COMPLETED)
 # Replenish submitted tasks up to the number that completed.
 futures.update(pool.submit(f, x) for x in islice(it, len(done)))
 yield from done


Thank you, that's very neat. It's just the sort of thing I was looking
for. Nice use of itertools.islice and yield from.

I'll try this out in the next few days and report back.

- Andrew

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: CSV writer question

2011-10-24 Thread Andrew McLean

On 24/10/2011 08:03, Chris Angelico wrote:

On Mon, Oct 24, 2011 at 4:18 PM, Jason Swailsjason.swa...@gmail.com  wrote:

my_csv = csv.writer(open('temp.1.csv', 'wb'))


Have you confirmed, or can you confirm, whether or not the file gets
closed automatically when the writer gets destructed? If so, all you
need to do is:

my_csv = something_else
# or:
del my_csv

to unbind what I assume is the only reference to the csv.writer, upon
which it should promptly clean itself up.
My understanding is that in cpython the file does get closed when the 
writer is deleted, however, it's not guaranteed to happen in other 
Python implementations (e.g. IronPython, PyPy and Jython).


Andrew

--
http://mail.python.org/mailman/listinfo/python-list


Re: Python Tools for Visual Studio - anyone using it?

2011-08-31 Thread Andrew McLean
I understand that Python Tools for Visual Studio doesn't work with VS 
Express, but does work with the (free) VS 2010 Shell. Does anyone know 
if you can install VS Express and VS Shell on the same machine?

--
http://mail.python.org/mailman/listinfo/python-list


Re: Instrumented web proxy

2008-03-28 Thread Andrew McLean
Paul Rubin wrote:
 Andrew McLean [EMAIL PROTECTED] writes:
 I would like to write a web (http) proxy which I can instrument to
 automatically extract information from certain web sites as I browse
 them. Specifically, I would want to process URLs that match a
 particular regexp. For those URLs I would have code that parsed the
 content and logged some of it.

 Think of it as web scraping under manual control.
 
 I've used Proxy 3 for this, a very cool program with powerful
 capabilities for on the fly html rewriting.
 
 http://theory.stanford.edu/~amitp/proxy.html

This looks very useful. Unfortunately I can't seem to get it to run 
under Windows (specifically Vista) using Python 1.5.2, 2.2.3 or 2.5.2. 
I'll try Linux if I get a chance.

-- 
http://mail.python.org/mailman/listinfo/python-list


Instrumented web proxy

2008-03-27 Thread Andrew McLean
I would like to write a web (http) proxy which I can instrument to 
automatically extract information from certain web sites as I browse 
them. Specifically, I would want to process URLs that match a particular 
regexp. For those URLs I would have code that parsed the content and 
logged some of it.

Think of it as web scraping under manual control.

I found this list of Python web proxies

http://www.xhaus.com/alan/python/proxies.html

Tiny HTTP Proxy in Python looks promising as it's nominally simple (not 
many lines of code)

http://www.okisoft.co.jp/esc/python/proxy/

It does what it's supposed to, but I'm a bit at a loss as where to 
intercept the traffic. I suspect it should be quite straightforward, but 
I'm finding the code a bit opaque.

Any suggestions?

Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting some element from sets.Set

2007-05-07 Thread Andrew McLean
[EMAIL PROTECTED] wrote:
 In the particular case, I have to read an attribute from any one of
 the elements, which one doesn't matter because this attribute value is
 same across all elements in the set.

Someone else pointed out that there might be better data structures. If 
performance was not an issue one approach would be illustrated by the 
following:

  Q=set(['A','a'])
  list(set(x.upper() for x in Q))
['A']

This has the benefit that it does not assume all the elements of the set 
have the same value of the given attribute.

Again not very efficient:

  list(Q)[0]
'A'

I'm guessing this would be quicker

  iter(Q).next()
'A'
-- 
http://mail.python.org/mailman/listinfo/python-list


Measureing memory used by a subprocess

2007-04-01 Thread Andrew McLean
I want to script the benchmarking of some compression algorithms on a 
Windows box. The algorithms are all embodied in command line 
executables, such as gzip and bzip2. I would like to measure three things:

1. size of compressed file
2. elapsed time (clock or preferably CPU)
3. memory used

The first is straightforward, as is measuring elapsed clock time. But 
how would I get the CPU time used by a sub-process or the memory used?

I'm guessing that the Windows Performance Counters may be relevant, see 
the recipe

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/303339

But I don't see any obvious way to get the process id of the spawned 
subprocess.

- Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Does Python have equivalent to MATLAB varargin, varargout, nargin, nargout?

2007-02-18 Thread Andrew McLean
Where you would use varargin and nargin in Matlab, you would use the 
*args mechanism in Python.

Try calling

def t1(*args):
print args
print len(args)

with different argument lists

Where you would use varargout and nargout in Matlab you would use tuple 
unpacking in Python.

Play with this

def t2(n):
return tuple(range(n))

a, b = t2(2)

x = t2(3)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ZODB and Python 2.5

2006-10-21 Thread Andrew McLean
Robert Kern wrote:
 I would suggest, in order:
 
 1) Look on the relevant mailing list for people talking about using ZODB 
 with Python 2.5.

Been there, didn't find anything. Except that recently released versions 
of Zope (2.9.5 and 2.10.0) aren't compatible with Python 2.5. [Being 
pedantic 2.9.5 doesn't work under Python 2.5, 2.10.0 is merely 
unsupported.]

 2) Just try it. Install Python 2.5 alongside 2.4, install ZODB, run the 
 test suite.

Now if ZODB had been pure Python, or I was using a Unix(ish) platform I 
would have tried that. Getting set up to compile C extensions under 
Windows is a bit too much hassle. I can wait ;-).
-- 
http://mail.python.org/mailman/listinfo/python-list


ZODB and Python 2.5

2006-10-20 Thread Andrew McLean
I'm going to have to delay upgrading to Python 2.5 until all the 
libraries I use support it. One key library for me is ZODB. I've Googled 
  and can't find any information on the developers' plans. Does anyone 
have any information that might help?

- Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Converting MSWord Docs to PDF

2006-10-10 Thread Andrew McLean
Steve Holden wrote:
 If that *isn't* satisfactory then a modest investment in Adobe 
 Acrobat/Distiller plus the use of Python's scripting facilities to 
 direct the conversion would be preferable to spending a huge amount of 
 time writing a hand-crafted solution.

An alternative to Adobe Distiller (part of Acrobat) is PDFCreator

http://sourceforge.net/projects/pdfcreator/

which is free. This installs as a Windows printer (using GhostScript as 
a backend to generate PDFs). It should be relatively straightforward to 
use Python scripting to drive Word to print documents to the 
PDFCreator pseudo-printer.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unexpected behaviour of csv module

2006-09-26 Thread Andrew McLean
John Machin wrote:
 You can fix that. The beauty of open source is that you can grab it
 (Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
 suits your needs. Go fer it!

Unfortunately the bits I should change are in _csv.c and, as I'm not 
very proficient at C, that wouldn't be a good idea. Anyway, for the 
specific brokenness of my CSV file, the simple workaround from my 
original post is fine.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unexpected behaviour of csv module

2006-09-25 Thread Andrew McLean
John Machin wrote:
 A better workaround IMHO is to strip each *field* after it is received
 from the csv reader. In fact, it is very rare that leading or trailing
 space in CSV fields is of any significance at all. Multiple spaces
 ditto. Just do this all the time:
 
 row = [' '.join(x.split()) for x in row]

The problem with removing the spaces after they are received from the 
csv reader is if you want to use DictReader. I like to use DictReader, 
without passing it the field list. The module then reads the field list 
from the first line, and in this situation you don't get an opportunity 
to strip the spaces from that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Unexpected behaviour of csv module

2006-09-24 Thread Andrew McLean
I have a bunch of csv files that have the following characteristics:

- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline

What surprised me was that the csv reader included the trailing space in 
the final field value returned, even though it is outside of the quotes. 


I've produced a test program (see below) that demonstrates this. There 
is a workaround, which is to not pass the csv reader the file iterator, 
but rather a generator that returns lines from the file with the 
trailing space stripped.

Interestingly, the same behaviour is seen if there are spaces before the 
field separator. They are also included in the preceding field value, 
even if they are outside the quotations. My workaround wouldn't help here.

Anyway is this a bug or a feature? If it is a feature then I'm curious 
as to why it is considered desirable behaviour.

- Andrew



import csv
filename = test_data.csv

# Generate a test file - note the spaces before the newlines
fout = open(filename, wb)
fout.write('Field1,Field2,Field3 \n')
fout.write('a,b,c \n')
fout.write('d ,e,f \n')
fout.close()

# Function to test a reader
def read_and_print(reader):
 for line in reader:
 print ,.join(['%s' % field for field in line])

# Read the test file - and print the output
reader = csv.reader(open(test_data.csv, rb))
read_and_print(reader)

# Now the workaround: a generator to strip the strings before the reader 
decodes them
def stripped(input):
 for line in input:
 yield line.strip()
reader = csv.reader(stripped(open(test_data.csv, rb)))
read_and_print(reader)

# Try using lineterminator instead - it doesn't work
reader = csv.reader(open(test_data.csv, rb), lineterminator= \r\n)
read_and_print(reader)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Algorithm Question

2006-09-14 Thread Andrew McLean
John Machin wrote:
 A quick silly question: what is the problem that you are trying to
 solve?

A fair question :-)

The problem may seem a bit strange, but here it is:

I have the ability to query a database in a legacy system and extract 
records which match a particular pattern. Specifically, I can perform 
queries for records that contain a given search term as a sub-string of 
a particular column. The specific column contains an address. This 
database can only be accessed through this particular interface (don't 
ask why, it's one of the reasons it's a *legacy* system).

I also have access to a list that contains the vast majority (possibly 
all) the addresses which are stored in the database.

Now I want to issue a series of queries, such that when I combine all 
the data returned I have accessed all the records in the database. 
However, I want to minimise the total number of queries and also want to 
keep the number of records returned by more than one query small.

Now the current approach I use is to divide the addresses I have into 
tokens and take the last token in the address (excluding the postal 
code). The union of these last tokens forms my set of queries. The 
last token in the address is typically a county or a town in a UK address.

This works, but I was wondering if I could do something more efficient. 
The problem is that while the search term London matches all the 
addresses in London it also returns all the addresses containing London 
Road, and a lot of towns have a London Road. Perhaps I would be better 
off searching for Road, Street, Avenue 

It occurred to me that this my be isomorphic to a known problem, however 
given that I want to keep two things small, the problem isn't very well 
defined.

The current approach works, I was just musing whether there was a faster 
approach, so don't think about it too hard.

- Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Are Python's reserved words reserved in places they dont need to be?

2006-09-14 Thread Andrew McLean
Roy Smith wrote:
 
 As I remember, you didn't need the whitespace either. IIRC, your example 
 above could have been written as:
 
   PROGRAMKWDS
   REALREAL,WRITE
   WRITE=1.0
   REAL=2.0
   WRITE(*,*)WRITE,REAL
   END
 

It's stranger than that. FORTRAN 77 is insensitive to white space (other 
than inside character literals).

So you could write the code like:

P  RO  G  RAM KW D  S
RE  ALRE  AL, WRITE
  WRITE = 1  .  0
RE  AL=2.0
WRI TE(*  ,  *)WRI  TE, REAL
E N   D

if you wanted to ;-)

When people complain that Python is sensitive to white space, remember 
this as the opposite extreme!

[Just for completeness I will add that there are rules about what 
columns the code has to be in, but that is separate from the white 
space issue.]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Algorithm Question

2006-09-11 Thread Andrew McLean
Carl Banks wrote:
 Andrew McLean wrote:
 I have a list of strings, A. I want to find a set of strings B such that
 for any a in A there exists b in B such that b is a sub-string of a.
 
 B=A?
 
 But I also want to minimise T = sum_j t_j where
 t_j = count of the number of elements in A which have b[j] as a sub-string
 
 If there not elements in A that are substrings of any other element in
 A, and if B=A, then t_j would be 1 for all j.  Which means B=A would be
 optimal (since elements of B have to be substring of at least one
 element in A).  It looks like the B={set of all elements in A that are
 not a substring of any other element in A} is the generally optimal
 solution.
 
 I suspect you mistyped or omitted something--problem is underspecified
 at best.

You are quite right. I was trying to generalise my real problem and 
missed out a constraint. I also want to keep length(B) small. 
Unfortunately, I'm a bit unsure about the relative importance of T and 
length(B), which makes the problem rather ill defined. I'll have to give 
this a bit more thought

-- 
http://mail.python.org/mailman/listinfo/python-list


Algorithm Question

2006-09-10 Thread Andrew McLean
This really an algorithm question more that a Python question, but it 
would be implemented in Python

I have a list of strings, A. I want to find a set of strings B such that 
for any a in A there exists b in B such that b is a sub-string of a. 
But I also want to minimise T = sum_j t_j where

t_j = count of the number of elements in A which have b[j] as a sub-string

My guess is that finding the smallest possible T satisfying the 
constraint would be hard. However, for my application just keeping it 
reasonably small would help.

In my case the list A contains over two million addresses.

The (top down) heuristic approach I am tempted to employ is to start by 
dividing the entries in A into sets of tokens, then take the union of 
all these sets as a starting point for B. Then I would try to trim B by

1. looking for elements that I could remove while still satisfying the 
constraint

2. replacing two elements by a common sub-string if that reduced T

Anyway. It occurred to me that this might be a known problem. Any 
pointers gratefully received.

- Andrew

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Linear regression in 3 dimensions

2006-09-08 Thread Andrew McLean
Bernhard,

Levenberg-Marquardt is a good solution when you want to solve a general 
non-linear least-squares problem. As Robert said, the OPs problem is 
linear and Robert's solution exploits that. Using LM here is unnecessary 
and I suspect a fair bit less efficient (i.e. slower).

- Andrew


[EMAIL PROTECTED] wrote:
 Hi Robert,
 
 I'm using the scipy package for such problems. In the submodule
 scipy.optimize there is an implmentation of a least-square fitting
 algorithm (Levenberg-Marquardt) called leastsq.
 
 You have to define a function that computes the residuals between your
 model and the data points:
 
 import scipy.optimize
 
 def model(parameter, x, y):
 a, b, c = parameter
 return a*x + b*y + c
 
 def residual(parameter, data, x, y):
 res = []
 for _x in x:
 for _y in y:
 res.append(data-model(parameter,x,y)
 return res
 
 params0 = [1., 1.,1.]
 result = scipy.optimize.leastsq(resdiual, params0, (data,x,y))
 fittedParams = result[0]
 
 If you haven't used numeric, numpy or scipy before, you should take a
 look at an introduction. It uses some nice extended array objects,
 where you can use some neat index tricks and compute values of array
 items without looping through it.
 
 Cheers! Bernhard
 
 
 
 Robert Kern wrote:
 [EMAIL PROTECTED] wrote:
 Hi all,

 I am seeking a module that will do the equivalent of linear regression in
 3D to yield a best fit a plane through a set of points (X1, Y1, Z1), (X1,
 Y1, Z1),... (Xn, Yn, Zn).

 The resulting equation to be of the form:

  Z = aX + bY + c

 The function I need would take the set of points and return a,c  c Any
 pointers to existing code / modules would be very helpful.
 Well, that's a very unspecified problem. You haven't defined best.

 But if we make the assumption that you want to minimize the squared error in 
 Z,
 that is minimize

Sum((Z[i] - (a*X[i] + b*Y[i] + c)) ** 2)

 then this is a standard linear algebra problem.

 In [1]: import numpy as np

 In [2]: a = 1.0

 In [3]: b = 2.0

 In [4]: c = 3.0

 In [5]: rs = np.random.RandomState(1234567890)  # Specify a seed for 
 repeatability

 In [6]: x = rs.uniform(size=100)

 In [7]: y = rs.uniform(size=100)

 In [8]: e = rs.standard_normal(size=100)

 In [9]: z = a*x + b*y + c + e

 In [10]: A = np.column_stack([x, y, np.ones_like(x)])

 In [11]: np.linalg.lstsq?
 Type:   function
 Base Class: type 'function'
 String Form:function lstsq at 0x6df070
 Namespace:  Interactive
 File:
 /Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/numpy-1.0b2.dev3002-py2.4-macosx-10.4-ppc.egg/numpy/linalg/linalg.py
 Definition: np.linalg.lstsq(a, b, rcond=1e-10)
 Docstring:
  returns x,resids,rank,s
  where x minimizes 2-norm(|b - Ax|)
resids is the sum square residuals
rank is the rank of A
s is the rank of the singular values of A in descending order

  If b is a matrix then x is also a matrix with corresponding columns.
  If the rank of A is less than the number of columns of A or greater than
  the number of rows, then residuals will be returned as an empty array
  otherwise resids = sum((b-dot(A,x)**2).
  Singular values less than s[0]*rcond are treated as zero.


 In [12]: abc, residuals, rank, s = np.linalg.lstsq(A, z)

 In [13]: abc
 Out[13]: array([ 0.93104714,  1.96780364,  3.15185125])

 --
 Robert Kern

 I have come to believe that the whole world is an enigma, a harmless enigma
   that is made terrible by our own mad attempt to interpret it as though it 
 had
   an underlying truth.
-- Umberto Eco
 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Looking for Python code to obsfucate mailto links on web site

2006-06-27 Thread Andrew McLean
Dan Sommers wrote:
 On Sun, 25 Jun 2006 21:10:31 +0100,
 Andrew McLean [EMAIL PROTECTED] wrote:
 
 I'm looking at putting some e-mail contact addresses on a web site,
 and wanted to make it difficult for spammers to harvest them.
 
 [ ... ]
 
 Searching the web it looks like the best solution for me might be to
 embed JavaScript in the web page that dynamically generates the e-mail
 address in the browser client.
 
 [ ... ]
 
 Now I could write suitable code myself, but would be surprised if it
 wasn't already available. Any pointers?
 
 Pointers?  What do you think this is, C?  ;-)  Try this:
 
 def spam_averse_email_address( email_address, text ):
 return HTML-embedded javascript to create a spam-averse mailto link
 
 def char_codes( a_string ):
 return ,.join(str(ord(a_char)) for a_char in a_string)
 
 return script type=text/javascript
 !--
 document.write(
 'a href=mailto:'
 + String.fromCharCode(%s)
 + ''
 + String.fromCharCode(%s)
 + '\/A');
 // --
 /script % (char_codes(email_address), char_codes(text))
 
 The newlines within the triple quoted string are important; use that
 function something like this:
 
 print html
 print headtitleTitle/title/head
 print body
 print P%s/P % spam_averse_email_address( '[EMAIL PROTECTED]',
'click here to email me' )
 print /body
 print /html
 
 You mentioned accessibility; make sure that your HTML does something
 sensible if the user's browser doesn't do javascript.
 
 HTH,
 Dan
 

That's great. Just what I was looking for.

-- 
http://mail.python.org/mailman/listinfo/python-list


Looking for Python code to obsfucate mailto links on web site

2006-06-25 Thread Andrew McLean
I'm looking at putting some e-mail contact addresses on a web site, and 
wanted to make it difficult for spammers to harvest them.

I found some Python code that I can call within my application.

http://www.zapyon.de/spam-me-not/

It works exactly as expected. However, I am concerned that the technique 
used for obsfucating the e-mail address may be a bit weak.

Searching the web it looks like the best solution for me might be to 
embed JavaScript in the web page that dynamically generates the e-mail 
address in the browser client.

I've found on-line tools that will generate suitable JavaScript, but 
need to automate the encoding process in Python.

Now I could write suitable code myself, but would be surprised if it 
wasn't already available. Any pointers?

To head of a few comments I'm anticipating ;-)
- no I don't want to use a contact form
- accessibility is an issue, but I'm also including postal addresses and 
phone numbers giving alternatives to e-mail. Also the main enquiry 
address won't be obfuscated.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python's CSV reader

2005-08-05 Thread Andrew McLean
In article [EMAIL PROTECTED], 
Stephan [EMAIL PROTECTED] writes
Thank you all for these interesting examples and methods!

You are welcome. One point. I think there have been at least two 
different interpretations of precisely what you task is.

I had assumed that all the different header lines contained data for 
the same fields in the same order, and similarly that all the detail 
lines contained data for the same fields in the same order.

However, I think Peter has answered on the basis that you have records 
consisting of pairs of lines, the first line being a header containing 
field names specific to that record with the second line containing the 
corresponding data.

It would help of you let us know which (if any) was correct.

-- 
Andrew McLean
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python's CSV reader

2005-08-04 Thread Andrew McLean
In article [EMAIL PROTECTED],
Stephan [EMAIL PROTECTED] writes
I'm fairly new to python and am working on parsing some delimited text
files.  I noticed that there's a nice CSV reading/writing module
included in the libraries.

My data files however, are odd in that they are composed of lines with
alternating formats. (Essentially the rows are a header record and a
corresponding detail record on the next line.  Each line type has a
different number of fields.)

Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Thanks for your insight,
Stephan


The csv module should be suitable. The reader just takes each line,
parses it, then returns a list of strings. It doesn't matter if
different lines have different numbers of fields.

To get an idea of what I mean, try something like the following
(untested):

import csv

reader = csv.reader(open(filename))

while True:

#  Read next header line, if there isn't one then exit the
loop
header = reader.next()
if not header: break

# Assume that there is a detail line if the preceding
# header line exists
detail = reader.next()

# Print the parsed data
print '-' * 40
print Header (%d fields): %s % (len(header), header)
print Detail (%d fields): %s % (len(detail), detail)

You could wrap this up into a class which returns (header, detail) pairs
and does better error handling, but the above code should illustrate the
basics.

-- 
Andrew McLean
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Coding style article with interesting section on white space

2005-01-30 Thread Andrew McLean
In article [EMAIL PROTECTED], Alex Martelli 
[EMAIL PROTECTED] writes
You're saying that using a different and better compiler cannot speed
the execution of your Fortran program by 25% when you move it from one
platform to another...?!  This seems totally absurd to me, and yet I see
no other way to interpret this assertion about Fortran programs not
suffering -- you're looking at it as a performance _hit_ but of course
it might just as well be construed as a performance _boost_ depending on
the direction you're moving your programs.
I think that upon mature consideration you will want to retract this
assertion, and admit that it IS perfectly possible for the same Fortran
program on the same hardware to have performance that differs by 25% or
more depending on how good the optimizers of different compilers happen
to be for that particular code, and therefore that, whatever point you
thought you were making here, it's in fact totally worthless.
Look at the Fortran compiler benchmarks here:
http://www.polyhedron.co.uk/compare/win32/f77bench_p4.html
for some concrete evidence to support Alex's point.
You will see that the average performance across different benchmarks of 
different Fortran compilers on the same platform can be as much a factor 
of two. Variation of individual benchmarks as much as a factor of three.

Some of you might be surprised at how many different Fortran compilers 
are available!

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list


Fuzzy matching of postal addresses [1/1]

2005-01-23 Thread Andrew McLean
In case anyone is interested, here is the latest.
I implemented an edit distance technique based on tokens. This 
incorporated a number of the ideas discussed in the thread.

It works pretty well on my data. I'm getting about 95% matching now, 
compared with 90% for the simple technique I originally tried. So I have 
matched half the outstanding cases.

I have spotted very few false positives, and very few cases where I 
could make a match manually. Although I suspect the code could still be 
improved.

It took a bit of head scratching to work out how to incorporate 
concatenation of tokens into the dynamic programming method, but I think 
I got there! At least my test cases seem to work!

#
# First attempt at a fuzzy compare of two addresses using a form of Edit 
Distance algorithm on tokens
# v0.5
# Andrew McLean, 23 January 2005
#
# The main routine editDistance takes two lists of tokens and returns a 
distance measure
# Allowed edits are replace, insert, delete and concatenate a pair of tokens.
# The cost of these operations depends on the value of the tokens and their 
position within the sequence.
#
# The tokens consist of a tuple containing a string representation and it's 
soundex encoding
# The program assumes that some normalisation has already been carried out, for 
instance converting all 
# text to lowercase. 
#
# The routine has undergone limited testing, but it appeared to work quite well 
for my application,
# with a reasonablly low level of false positives
#
# I'm not convinced that I have got the logic quite right in the dynamic 
programming, dealing correctly with
# token pair concatenation is non-trivial.
#
# It would be neater to have an out of band flag for impossible/infinite cost. 
Could abstract this into a
# Cost class. But I am a bit concerned about efficiency. Could use a negative 
number for infinite cost
# and use a modified min function to reflect this. The approach I am using with 
a very big number for INFINITY
# will be fine for any sensible tokens relating to addresses.
#
# The code could probably do with more test caases.
#
# Also, if I was going to refactor the code I would either
# 1. Make this a bit more object oriented by introducing a Token class.
# 2. Not precompute the soundex encodings. It is probably sufficient to use a 
memoized soundex routine.
#

# Standard library module imports
import re, sys, os

# Kludge!
sys.path.append(os.path.abspath('../ZODB'))

# Paul Moore's Memoize class from Python cookbook
# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52201
from memoize import Memoize

# Public domain soundex implementation, by Skip Montanaro, 21 December 2000
# http://manatee.mojam.com/~skip/python/soundex.py
import soundex

# Memoize soundex for speed
get_soundex = Memoize(soundex.get_soundex)

# List of numbers spelt out
numbers_spelt = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 
'eight', 'nine', 'ten', 'eleven',
 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 
'seventeen', 'eighteen', 'nineteen',
 'twenty']
digitsToTextMap = dict([(str(i+1), numbers_spelt[i]) for i in 
range(len(numbers_spelt))])

# Set up a dictonary mapping abbreviations to the full text version
# Each abbreviation can map to a single string expansion or a list of possible 
expansions
abbrev = {'cott': 'cottage', 'rd':'road', 'fm':'farm', 'st':['street', 'saint']}
# ... include number to text mapping
abbrev.update(digitsToTextMap)

# Regular expression to find tokens containing a number
contains_number = re.compile('\d')
# ... could include numbers spelt in words, but it causes more problems than it 
solves
# ... as lots of words include character sequences like one
##contains_number = re.compile('\d|'+'|'.join(numbers_spelt))

# List of minor tokens
minorTokenList = ['the', 'at']

# Various weights
POSITION_WEIGHT = 1.2
REPLACEMENT_COST = 50
SOUNDEX_MATCH_COST = 0.95
PLURAL_COST = 0.25
ABBREV_COST = 0.25
INSERT_TOKEN_WITH_NUMBER = 5
INSERT_TOKEN_AFTER_NUMBER = 10
INSERT_TOKEN = 2
INSERT_MINOR_TOKEN = 0.5
CONCAT_COST = 0.2
INFINITY = 1000

def containsNumber(token):
Does the token contain any digits
return contains_number.match(token[0])
# Memoize it
containsNumber = Memoize(containsNumber)

def replaceCost(token1, token2, pos, allowSoundex=True):
Cost of replacing token1 with token2 (or vice versa)
at a specific normalised position within the sequence

# Make sure token1 is shortest
m, n = len(token1[0]), len(token2[0])
if m  n:
token1, token2 = token2, token1
m, n = n, m

# Look for exact matches
if token1[0] == token2[0]:
return 0

# Look for plurals
if (n - m == 1) and (token2[0] == token1[0] + 's'):
return PLURAL_COST

# Look for abbreviations
try:
expansion = abbrev[token1[0]]
except KeyError:
pass
else:
if type(expansion) == list and token2[0] in expansion:
return ABBREV_COST

Re: Fuzzy matching of postal addresses [1/1]

2005-01-23 Thread Andrew McLean
In article [EMAIL PROTECTED], John 
Machin [EMAIL PROTECTED] writes
Andrew McLean wrote:
In case anyone is interested, here is the latest.

def insCost(tokenList, indx, pos):
The cost of inserting a specific token at a specific
normalised position along the sequence.
if containsNumber(tokenList[indx]):
return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos)
elif indx  0 and containsNumber(tokenList[indx-1]):
return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 -
pos)
elif tokenList[indx][0] in minorTokenList:
return INSERT_MINOR_TOKEN
else:
return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos)
def delCost(tokenList, indx, pos):
The cost of deleting a specific token at a specific normalised
position along the sequence.
This is exactly the same cost as inserting a token.
return insCost(tokenList, indx, pos)
Functions are first-class citizens of Pythonia -- so just do this:
delCost = insCost
Actually, the code used to look like that. I think I changed it so that 
it would look clearer. But perhaps that was a bad idea.

Re speed generally: (1) How many addresses in each list and how long is
it taking? On what sort of configuration? (2) Have you considered using
pysco -- if not running on x86 architecture, consider exporting your
files to a grunty PC and doing the match there. (3) Have you considered
some relatively fast filter to pre-qualify pairs of addresses before
you pass the pair to your relatively slow routine?
There are approx. 50,000 addresses in each list.
At the moment the processing assumes all the postcodes are correct, and 
only compares addresses with matching postcodes. This makes it a lot 
faster. But may miss some cases of mismatched postcodes.

Also it does two passes. One looking for exact matches of token 
sequences. This deals with about half the cases. Only then do I employ 
the, more expensive, edit distance technique.

Overall, the program runs in less than half an hour. Specifically it 
takes about 60s per thousand addresses, which requires an average of 
about 8 calls to editDistance per address. Psyco.full() reduced the 60s 
to 45s.

I'll only try optimisation if I need to use it much more.
Soundex?? To put it bluntly, the _only_ problem to which soundex is the
preferred solution is genealogy searching in the US census records, and
even then one needs to know what varieties of the algorithm were in use
at what times. I thought you said your addresses came from
authoritative sources. You have phonetic errors? Can you give some
examples of pairs of tokens that illustrate the problem you are trying
to overcome with soundex?
I'm sure that in retrospect Soundex might not be a good choice. The 
misspellings tend to be minor, e.g.
Kitwhistle and KITTWHISTLE
Tythe and TITHE

I was tempted by an edit distance technique on the tokens, but would 
prefer a hash based method for efficiency reasons.

Back to speed again: When you look carefully at the dynamic programming
algorithm for edit distance, you will note that it is _not_ necessary
to instantiate the whole NxM matrix -- it only ever refers to the
current row and the previous row. What does space saving have to do
with speed, you ask? Well, Python is not FORTRAN; it takes considerable
effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows
and swap (the pointers to) them each time around the outer loop. At the
expense of a little more complexity, one can reduce this to one row and
3 variables (north, northwest, and west) corresponding to d[i-1][j],
d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first.
Hope some of this helps,
Thanks for that. The first edit distance algorithm I looked at did it 
that way. But I based my code on a version of the algorithm I could 
actually follow ;-). Getting the concatenation bit right was non-trivial 
and it was useful to store all of d for debugging purposes.

As to Python not being Fortran. You've found me out. The three languages 
I am most comfortable with are Fortran, Matlab and Python. It did occur 
to me that numarray might be a more efficient way of dealing with a 
4-dimensional array, but the arrays aren't very big, so the overhead in 
setting them up might be significant.

The simplest optimisation would be to replace the two indices used to 
deal with concatenation by four explicit variables. And then, as you 
said, I could just store the three last rows, and avoid any multiple 
indexing.

As with all these potential optimisations, you don't know until you try 
them.

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list


Re: Driving win32 GUIs with Python

2004-12-19 Thread Andrew McLean
In article [EMAIL PROTECTED], 
Fredrik Lundh [EMAIL PROTECTED] writes
Andrew McLean wrote:
I have a requirement to drive a Windows GUI program from a Python 
Script. The program was originally a DOS program written in Turbo 
Pascal, and was recently translated to Delphi. I don't think it 
exposes an OLE or other automation interface. I don't have access to 
the source.

A bit of Googling turned up some blog entries, which look useful:
http://www.brunningonline.net/simon/blog/archives/000652.html
Before ploughing ahead I wanted to check whether any useful Python 
tools are available now, which  weren't when the articles above were 
written.
watsup is winGuiAuto plus lots of other stuff (focused on testing):
   http://www.tizmoi.net/watsup/intro.html
/F
Excellent. That look like just the sort of thing I was looking for.
--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list