Re: Dowloading package dependencies from locked down machine
On 26/07/2020 14:07, Mike Dewhirst wrote: I think your best bet is to make a formal business case to your IT people and explain what's in it for them. If they hold all the cards you defeat them at your peril. The issue is that the IT department thinks that installing the full power of Python scripting on an Internet facing machine is inconsistent with the "Cyber Essentials Plus" accreditiation that they need to win Government contracts. I'm trying to come up with an alternative that would be acceptable to them, I'm not going behind their backs. I wonder, would it be possible to create a standalone executable version of pip with py2exe or similar? - Andrew -- https://mail.python.org/mailman/listinfo/python-list
Dowloading package dependencies from locked down machine
At work my only Internet access is via a locked-down PC. The IT department are not willing to install Python on it [1]. Ideally I would download packages and their dependencies from PyPi using "pip download" at the command line. Any better solutions than downloading the package in a browser, finding its dependencies manually, downloading these and so on recursively? My dream solution would be for PyPi to provide a link to a zip file that bundled up a package and its dependencies, but I realise that this is probably a very niche requirement. - Andrew [1] Apparently they think installing Python would be incompatible with their Cyber Essentials Plus security accreditation, although it's apparently fine to have Microsoft Office 365 with a build in VBA interpreter! -- https://mail.python.org/mailman/listinfo/python-list
Nesting concurrent.futures.ThreadPoolExecutor
I have a program where I am currently using a concurrent.futures.ThreadPoolExecutor to run multiple tasks concurrently. These tasks are typically I/O bound, involving access to local databases and remote REST APIs. However, these tasks could themselves be split into subtasks, which would also benefit from concurrency. What I am hoping is that it is safe to use a concurrent.futures.ThreadPoolExecutor within the tasks. I have coded up a toy example, which seems to work. However, I'd like some confidence that this is intentional. Concurrency is notoriously tricky. I very much hope this is safe, because otherwise it would not be safe to use a ThreadPoolExecutor to execute arbitrary code, in case it also used concurrent.futures to exploit concurrency. Here is the toy example: |importconcurrent.futures definner(i,j):returni,j,i**j defouter(i):withconcurrent.futures.ThreadPoolExecutor(max_workers=5)asexecutor:futures ={executor.submit(inner,i,j):j forj inrange(5)}results =[]forfuture inconcurrent.futures.as_completed(futures):results.append(future.result())returnresults defmain():withconcurrent.futures.ThreadPoolExecutor(max_workers=5)asexecutor:futures ={executor.submit(outer,i):i fori inrange(10)}results =[]forfuture inconcurrent.futures.as_completed(futures):results.extend(future.result())print(results)if__name__ =="__main__":main()| I have previously posted this on Stack Overflow, but didn't get any replies. Apologies if you are seeing this twice. https://stackoverflow.com/questions/44989473/nesting-concurrent-futures-threadpoolexecutor -- https://mail.python.org/mailman/listinfo/python-list
Re: Real-world use of concurrent.futures
On 08/05/2014 21:44, Ian Kelly wrote: On May 8, 2014 12:57 PM, Andrew McLean li...@andros.org.uk mailto:li...@andros.org.uk wrote: So far so good. However, I thought this would be an opportunity to explore concurrent.futures and to see whether it offered any benefits over the more explicit approach discussed above. The problem I am having is that all the discussions I can find of the use of concurrent.futures show use with toy problems involving just a few tasks. The url downloader in the documentation is typical, it proceeds as follows: 1. Get an instance of concurrent.futuresThreadPoolExecutor 2. Submit a few tasks to the executer 3. Iterate over the results using concurrent.futures.as_completed That's fine, but I suspect that isn't a helpful pattern if I have a very large number of tasks. In my case I could run out of memory if I tried submitting all of the tasks to the executor before processing any of the results. I thought that ThreadPoolExecutor.map would handle this transparently if you passed it a lazy iterable such as a generator. From my testing though, that seems not to be the case; with a generator of 10 items and a pool of 2 workers, the entire generator was consumed before any results were returned. I'm guessing what I want to do is, submit tasks in batches of perhaps a few hundred, iterate over the results until most are complete, then submit some more tasks and so on. I'm struggling to see how to do this elegantly without a lot of messy code just there to do bookkeeping. This can't be an uncommon scenario. Am I missing something, or is this just not a job suitable for futures? I don't think it needs to be messy. Something like this should do the trick, I think: from concurrent.futures import * from itertools import islice def batched_pool_runner(f, iterable, pool, batch_size): it = iter(iterable) # Submit the first batch of tasks. futures = set(pool.submit(f, x) for x in islice(it, batch_size)) while futures: done, futures = wait(futures, return_when=FIRST_COMPLETED) # Replenish submitted tasks up to the number that completed. futures.update(pool.submit(f, x) for x in islice(it, len(done))) yield from done That worked very nicely, thank you. I think that would make a good recipe, whether for the documentation or elsewhere. I suspect I'm not the only person that would benefit from something to bridge the gap between a toy example and something practical. Andrew -- https://mail.python.org/mailman/listinfo/python-list
Real-world use of concurrent.futures
I have a problem that would benefit from a multithreaded implementation and having trouble understanding how to approach it using concurrent.futures. The details don't really matter, but it will probably help to be explicit. I have a large CSV file that contains a lot of fields, amongst them one containing email addresses. I want to write a program that validates the email addresses by checking that the domain names have a valid MX record. The output will be a copy of the file with any invalid email addresses removed. Because of latency in the DNS lookup this could benefit from multithreading. I have written similar code in the past using explicit threads communicating via queues. For this example, I could have a thread that read the file using csv.DictReader, putting dicts containing records from the input file into a (finite length) queue. Then I would have a number of worker threads reading the queue, performing the validation and putting validated results in a second queue. A final thread would read from the second queue writing the results to the output file. So far so good. However, I thought this would be an opportunity to explore concurrent.futures and to see whether it offered any benefits over the more explicit approach discussed above. The problem I am having is that all the discussions I can find of the use of concurrent.futures show use with toy problems involving just a few tasks. The url downloader in the documentation is typical, it proceeds as follows: 1. Get an instance of concurrent.futuresThreadPoolExecutor 2. Submit a few tasks to the executer 3. Iterate over the results using concurrent.futures.as_completed That's fine, but I suspect that isn't a helpful pattern if I have a very large number of tasks. In my case I could run out of memory if I tried submitting all of the tasks to the executor before processing any of the results. I'm guessing what I want to do is, submit tasks in batches of perhaps a few hundred, iterate over the results until most are complete, then submit some more tasks and so on. I'm struggling to see how to do this elegantly without a lot of messy code just there to do bookkeeping. This can't be an uncommon scenario. Am I missing something, or is this just not a job suitable for futures? Regards, Andrew -- https://mail.python.org/mailman/listinfo/python-list
Re: Real-world use of concurrent.futures
On 08/05/2014 20:06, Chris Angelico wrote: On Fri, May 9, 2014 at 4:55 AM, Andrew McLean li...@andros.org.uk wrote: Because of latency in the DNS lookup this could benefit from multithreading. Before you go too far down roads that are starting to look problematic: A DNS lookup is a UDP packet out and a UDP packet in (ignoring the possibility of TCP queries, which you probably won't be doing here). Maybe it would be easier to implement it as asynchronous networking? I don't know that Python makes it easy for you to construct DNS requests and parse DNS responses; that's something more in Pike's line of work. But it may be more possible to outright do the DNS query asynchronously. TBH I haven't looked into it; but it's another option to consider. Separately from your programming model, though, how are you handling timeouts? Any form of DNS error (NXDOMAIN being the most likely), and the sort-of-error-but-not-error state of getting a response with no answer, indicates that the address is invalid; but what if you just don't hear back from the server? Will that mark addresses off as dead? ChrisA I've done this on a very small scale in the past. I used http://www.dnspython.org/ to do the heavy lifting. The relevant bits of code looks like: # Set up the default dns resolver and add a cache dns.resolver.default_resolver = dns.resolver.Resolver() dns.resolver.default_resolver.cache = dns.resolver.Cache() and try: result = dns.resolver.query(domain, 'MX') return True except dns.resolver.NXDOMAIN: return False except dns.resolver.NoAnswer: return False except dns.resolver.Timeout: print *** timeout looking for the MX record for the domain: %s % domain return False You are right, I'll need to do something more sophisticated when I encounter a timeout, but I think that's a matter of detail. Andrew -- https://mail.python.org/mailman/listinfo/python-list
Re: Real-world use of concurrent.futures
On 08/05/2014 21:44, Ian Kelly wrote: I don't think it needs to be messy. Something like this should do the trick, I think: from concurrent.futures import * from itertools import islice def batched_pool_runner(f, iterable, pool, batch_size): it = iter(iterable) # Submit the first batch of tasks. futures = set(pool.submit(f, x) for x in islice(it, batch_size)) while futures: done, futures = wait(futures, return_when=FIRST_COMPLETED) # Replenish submitted tasks up to the number that completed. futures.update(pool.submit(f, x) for x in islice(it, len(done))) yield from done Thank you, that's very neat. It's just the sort of thing I was looking for. Nice use of itertools.islice and yield from. I'll try this out in the next few days and report back. - Andrew -- https://mail.python.org/mailman/listinfo/python-list
Re: CSV writer question
On 24/10/2011 08:03, Chris Angelico wrote: On Mon, Oct 24, 2011 at 4:18 PM, Jason Swailsjason.swa...@gmail.com wrote: my_csv = csv.writer(open('temp.1.csv', 'wb')) Have you confirmed, or can you confirm, whether or not the file gets closed automatically when the writer gets destructed? If so, all you need to do is: my_csv = something_else # or: del my_csv to unbind what I assume is the only reference to the csv.writer, upon which it should promptly clean itself up. My understanding is that in cpython the file does get closed when the writer is deleted, however, it's not guaranteed to happen in other Python implementations (e.g. IronPython, PyPy and Jython). Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Python Tools for Visual Studio - anyone using it?
I understand that Python Tools for Visual Studio doesn't work with VS Express, but does work with the (free) VS 2010 Shell. Does anyone know if you can install VS Express and VS Shell on the same machine? -- http://mail.python.org/mailman/listinfo/python-list
Re: Instrumented web proxy
Paul Rubin wrote: Andrew McLean [EMAIL PROTECTED] writes: I would like to write a web (http) proxy which I can instrument to automatically extract information from certain web sites as I browse them. Specifically, I would want to process URLs that match a particular regexp. For those URLs I would have code that parsed the content and logged some of it. Think of it as web scraping under manual control. I've used Proxy 3 for this, a very cool program with powerful capabilities for on the fly html rewriting. http://theory.stanford.edu/~amitp/proxy.html This looks very useful. Unfortunately I can't seem to get it to run under Windows (specifically Vista) using Python 1.5.2, 2.2.3 or 2.5.2. I'll try Linux if I get a chance. -- http://mail.python.org/mailman/listinfo/python-list
Instrumented web proxy
I would like to write a web (http) proxy which I can instrument to automatically extract information from certain web sites as I browse them. Specifically, I would want to process URLs that match a particular regexp. For those URLs I would have code that parsed the content and logged some of it. Think of it as web scraping under manual control. I found this list of Python web proxies http://www.xhaus.com/alan/python/proxies.html Tiny HTTP Proxy in Python looks promising as it's nominally simple (not many lines of code) http://www.okisoft.co.jp/esc/python/proxy/ It does what it's supposed to, but I'm a bit at a loss as where to intercept the traffic. I suspect it should be quite straightforward, but I'm finding the code a bit opaque. Any suggestions? Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Getting some element from sets.Set
[EMAIL PROTECTED] wrote: In the particular case, I have to read an attribute from any one of the elements, which one doesn't matter because this attribute value is same across all elements in the set. Someone else pointed out that there might be better data structures. If performance was not an issue one approach would be illustrated by the following: Q=set(['A','a']) list(set(x.upper() for x in Q)) ['A'] This has the benefit that it does not assume all the elements of the set have the same value of the given attribute. Again not very efficient: list(Q)[0] 'A' I'm guessing this would be quicker iter(Q).next() 'A' -- http://mail.python.org/mailman/listinfo/python-list
Measureing memory used by a subprocess
I want to script the benchmarking of some compression algorithms on a Windows box. The algorithms are all embodied in command line executables, such as gzip and bzip2. I would like to measure three things: 1. size of compressed file 2. elapsed time (clock or preferably CPU) 3. memory used The first is straightforward, as is measuring elapsed clock time. But how would I get the CPU time used by a sub-process or the memory used? I'm guessing that the Windows Performance Counters may be relevant, see the recipe http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/303339 But I don't see any obvious way to get the process id of the spawned subprocess. - Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Does Python have equivalent to MATLAB varargin, varargout, nargin, nargout?
Where you would use varargin and nargin in Matlab, you would use the *args mechanism in Python. Try calling def t1(*args): print args print len(args) with different argument lists Where you would use varargout and nargout in Matlab you would use tuple unpacking in Python. Play with this def t2(n): return tuple(range(n)) a, b = t2(2) x = t2(3) -- http://mail.python.org/mailman/listinfo/python-list
Re: ZODB and Python 2.5
Robert Kern wrote: I would suggest, in order: 1) Look on the relevant mailing list for people talking about using ZODB with Python 2.5. Been there, didn't find anything. Except that recently released versions of Zope (2.9.5 and 2.10.0) aren't compatible with Python 2.5. [Being pedantic 2.9.5 doesn't work under Python 2.5, 2.10.0 is merely unsupported.] 2) Just try it. Install Python 2.5 alongside 2.4, install ZODB, run the test suite. Now if ZODB had been pure Python, or I was using a Unix(ish) platform I would have tried that. Getting set up to compile C extensions under Windows is a bit too much hassle. I can wait ;-). -- http://mail.python.org/mailman/listinfo/python-list
ZODB and Python 2.5
I'm going to have to delay upgrading to Python 2.5 until all the libraries I use support it. One key library for me is ZODB. I've Googled and can't find any information on the developers' plans. Does anyone have any information that might help? - Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Converting MSWord Docs to PDF
Steve Holden wrote: If that *isn't* satisfactory then a modest investment in Adobe Acrobat/Distiller plus the use of Python's scripting facilities to direct the conversion would be preferable to spending a huge amount of time writing a hand-crafted solution. An alternative to Adobe Distiller (part of Acrobat) is PDFCreator http://sourceforge.net/projects/pdfcreator/ which is free. This installs as a Windows printer (using GhostScript as a backend to generate PDFs). It should be relatively straightforward to use Python scripting to drive Word to print documents to the PDFCreator pseudo-printer. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unexpected behaviour of csv module
John Machin wrote: You can fix that. The beauty of open source is that you can grab it (Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it suits your needs. Go fer it! Unfortunately the bits I should change are in _csv.c and, as I'm not very proficient at C, that wouldn't be a good idea. Anyway, for the specific brokenness of my CSV file, the simple workaround from my original post is fine. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unexpected behaviour of csv module
John Machin wrote: A better workaround IMHO is to strip each *field* after it is received from the csv reader. In fact, it is very rare that leading or trailing space in CSV fields is of any significance at all. Multiple spaces ditto. Just do this all the time: row = [' '.join(x.split()) for x in row] The problem with removing the spaces after they are received from the csv reader is if you want to use DictReader. I like to use DictReader, without passing it the field list. The module then reads the field list from the first line, and in this situation you don't get an opportunity to strip the spaces from that. -- http://mail.python.org/mailman/listinfo/python-list
Unexpected behaviour of csv module
I have a bunch of csv files that have the following characteristics: - field delimiter is a comma - all fields quoted with double quotes - lines terminated by a *space* followed by a newline What surprised me was that the csv reader included the trailing space in the final field value returned, even though it is outside of the quotes. I've produced a test program (see below) that demonstrates this. There is a workaround, which is to not pass the csv reader the file iterator, but rather a generator that returns lines from the file with the trailing space stripped. Interestingly, the same behaviour is seen if there are spaces before the field separator. They are also included in the preceding field value, even if they are outside the quotations. My workaround wouldn't help here. Anyway is this a bug or a feature? If it is a feature then I'm curious as to why it is considered desirable behaviour. - Andrew import csv filename = test_data.csv # Generate a test file - note the spaces before the newlines fout = open(filename, wb) fout.write('Field1,Field2,Field3 \n') fout.write('a,b,c \n') fout.write('d ,e,f \n') fout.close() # Function to test a reader def read_and_print(reader): for line in reader: print ,.join(['%s' % field for field in line]) # Read the test file - and print the output reader = csv.reader(open(test_data.csv, rb)) read_and_print(reader) # Now the workaround: a generator to strip the strings before the reader decodes them def stripped(input): for line in input: yield line.strip() reader = csv.reader(stripped(open(test_data.csv, rb))) read_and_print(reader) # Try using lineterminator instead - it doesn't work reader = csv.reader(open(test_data.csv, rb), lineterminator= \r\n) read_and_print(reader) -- http://mail.python.org/mailman/listinfo/python-list
Re: Algorithm Question
John Machin wrote: A quick silly question: what is the problem that you are trying to solve? A fair question :-) The problem may seem a bit strange, but here it is: I have the ability to query a database in a legacy system and extract records which match a particular pattern. Specifically, I can perform queries for records that contain a given search term as a sub-string of a particular column. The specific column contains an address. This database can only be accessed through this particular interface (don't ask why, it's one of the reasons it's a *legacy* system). I also have access to a list that contains the vast majority (possibly all) the addresses which are stored in the database. Now I want to issue a series of queries, such that when I combine all the data returned I have accessed all the records in the database. However, I want to minimise the total number of queries and also want to keep the number of records returned by more than one query small. Now the current approach I use is to divide the addresses I have into tokens and take the last token in the address (excluding the postal code). The union of these last tokens forms my set of queries. The last token in the address is typically a county or a town in a UK address. This works, but I was wondering if I could do something more efficient. The problem is that while the search term London matches all the addresses in London it also returns all the addresses containing London Road, and a lot of towns have a London Road. Perhaps I would be better off searching for Road, Street, Avenue It occurred to me that this my be isomorphic to a known problem, however given that I want to keep two things small, the problem isn't very well defined. The current approach works, I was just musing whether there was a faster approach, so don't think about it too hard. - Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Are Python's reserved words reserved in places they dont need to be?
Roy Smith wrote: As I remember, you didn't need the whitespace either. IIRC, your example above could have been written as: PROGRAMKWDS REALREAL,WRITE WRITE=1.0 REAL=2.0 WRITE(*,*)WRITE,REAL END It's stranger than that. FORTRAN 77 is insensitive to white space (other than inside character literals). So you could write the code like: P RO G RAM KW D S RE ALRE AL, WRITE WRITE = 1 . 0 RE AL=2.0 WRI TE(* , *)WRI TE, REAL E N D if you wanted to ;-) When people complain that Python is sensitive to white space, remember this as the opposite extreme! [Just for completeness I will add that there are rules about what columns the code has to be in, but that is separate from the white space issue.] -- http://mail.python.org/mailman/listinfo/python-list
Re: Algorithm Question
Carl Banks wrote: Andrew McLean wrote: I have a list of strings, A. I want to find a set of strings B such that for any a in A there exists b in B such that b is a sub-string of a. B=A? But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. You are quite right. I was trying to generalise my real problem and missed out a constraint. I also want to keep length(B) small. Unfortunately, I'm a bit unsure about the relative importance of T and length(B), which makes the problem rather ill defined. I'll have to give this a bit more thought -- http://mail.python.org/mailman/listinfo/python-list
Algorithm Question
This really an algorithm question more that a Python question, but it would be implemented in Python I have a list of strings, A. I want to find a set of strings B such that for any a in A there exists b in B such that b is a sub-string of a. But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string My guess is that finding the smallest possible T satisfying the constraint would be hard. However, for my application just keeping it reasonably small would help. In my case the list A contains over two million addresses. The (top down) heuristic approach I am tempted to employ is to start by dividing the entries in A into sets of tokens, then take the union of all these sets as a starting point for B. Then I would try to trim B by 1. looking for elements that I could remove while still satisfying the constraint 2. replacing two elements by a common sub-string if that reduced T Anyway. It occurred to me that this might be a known problem. Any pointers gratefully received. - Andrew -- http://mail.python.org/mailman/listinfo/python-list
Re: Linear regression in 3 dimensions
Bernhard, Levenberg-Marquardt is a good solution when you want to solve a general non-linear least-squares problem. As Robert said, the OPs problem is linear and Robert's solution exploits that. Using LM here is unnecessary and I suspect a fair bit less efficient (i.e. slower). - Andrew [EMAIL PROTECTED] wrote: Hi Robert, I'm using the scipy package for such problems. In the submodule scipy.optimize there is an implmentation of a least-square fitting algorithm (Levenberg-Marquardt) called leastsq. You have to define a function that computes the residuals between your model and the data points: import scipy.optimize def model(parameter, x, y): a, b, c = parameter return a*x + b*y + c def residual(parameter, data, x, y): res = [] for _x in x: for _y in y: res.append(data-model(parameter,x,y) return res params0 = [1., 1.,1.] result = scipy.optimize.leastsq(resdiual, params0, (data,x,y)) fittedParams = result[0] If you haven't used numeric, numpy or scipy before, you should take a look at an introduction. It uses some nice extended array objects, where you can use some neat index tricks and compute values of array items without looping through it. Cheers! Bernhard Robert Kern wrote: [EMAIL PROTECTED] wrote: Hi all, I am seeking a module that will do the equivalent of linear regression in 3D to yield a best fit a plane through a set of points (X1, Y1, Z1), (X1, Y1, Z1),... (Xn, Yn, Zn). The resulting equation to be of the form: Z = aX + bY + c The function I need would take the set of points and return a,c c Any pointers to existing code / modules would be very helpful. Well, that's a very unspecified problem. You haven't defined best. But if we make the assumption that you want to minimize the squared error in Z, that is minimize Sum((Z[i] - (a*X[i] + b*Y[i] + c)) ** 2) then this is a standard linear algebra problem. In [1]: import numpy as np In [2]: a = 1.0 In [3]: b = 2.0 In [4]: c = 3.0 In [5]: rs = np.random.RandomState(1234567890) # Specify a seed for repeatability In [6]: x = rs.uniform(size=100) In [7]: y = rs.uniform(size=100) In [8]: e = rs.standard_normal(size=100) In [9]: z = a*x + b*y + c + e In [10]: A = np.column_stack([x, y, np.ones_like(x)]) In [11]: np.linalg.lstsq? Type: function Base Class: type 'function' String Form:function lstsq at 0x6df070 Namespace: Interactive File: /Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/numpy-1.0b2.dev3002-py2.4-macosx-10.4-ppc.egg/numpy/linalg/linalg.py Definition: np.linalg.lstsq(a, b, rcond=1e-10) Docstring: returns x,resids,rank,s where x minimizes 2-norm(|b - Ax|) resids is the sum square residuals rank is the rank of A s is the rank of the singular values of A in descending order If b is a matrix then x is also a matrix with corresponding columns. If the rank of A is less than the number of columns of A or greater than the number of rows, then residuals will be returned as an empty array otherwise resids = sum((b-dot(A,x)**2). Singular values less than s[0]*rcond are treated as zero. In [12]: abc, residuals, rank, s = np.linalg.lstsq(A, z) In [13]: abc Out[13]: array([ 0.93104714, 1.96780364, 3.15185125]) -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Looking for Python code to obsfucate mailto links on web site
Dan Sommers wrote: On Sun, 25 Jun 2006 21:10:31 +0100, Andrew McLean [EMAIL PROTECTED] wrote: I'm looking at putting some e-mail contact addresses on a web site, and wanted to make it difficult for spammers to harvest them. [ ... ] Searching the web it looks like the best solution for me might be to embed JavaScript in the web page that dynamically generates the e-mail address in the browser client. [ ... ] Now I could write suitable code myself, but would be surprised if it wasn't already available. Any pointers? Pointers? What do you think this is, C? ;-) Try this: def spam_averse_email_address( email_address, text ): return HTML-embedded javascript to create a spam-averse mailto link def char_codes( a_string ): return ,.join(str(ord(a_char)) for a_char in a_string) return script type=text/javascript !-- document.write( 'a href=mailto:' + String.fromCharCode(%s) + '' + String.fromCharCode(%s) + '\/A'); // -- /script % (char_codes(email_address), char_codes(text)) The newlines within the triple quoted string are important; use that function something like this: print html print headtitleTitle/title/head print body print P%s/P % spam_averse_email_address( '[EMAIL PROTECTED]', 'click here to email me' ) print /body print /html You mentioned accessibility; make sure that your HTML does something sensible if the user's browser doesn't do javascript. HTH, Dan That's great. Just what I was looking for. -- http://mail.python.org/mailman/listinfo/python-list
Looking for Python code to obsfucate mailto links on web site
I'm looking at putting some e-mail contact addresses on a web site, and wanted to make it difficult for spammers to harvest them. I found some Python code that I can call within my application. http://www.zapyon.de/spam-me-not/ It works exactly as expected. However, I am concerned that the technique used for obsfucating the e-mail address may be a bit weak. Searching the web it looks like the best solution for me might be to embed JavaScript in the web page that dynamically generates the e-mail address in the browser client. I've found on-line tools that will generate suitable JavaScript, but need to automate the encoding process in Python. Now I could write suitable code myself, but would be surprised if it wasn't already available. Any pointers? To head of a few comments I'm anticipating ;-) - no I don't want to use a contact form - accessibility is an issue, but I'm also including postal addresses and phone numbers giving alternatives to e-mail. Also the main enquiry address won't be obfuscated. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's CSV reader
In article [EMAIL PROTECTED], Stephan [EMAIL PROTECTED] writes Thank you all for these interesting examples and methods! You are welcome. One point. I think there have been at least two different interpretations of precisely what you task is. I had assumed that all the different header lines contained data for the same fields in the same order, and similarly that all the detail lines contained data for the same fields in the same order. However, I think Peter has answered on the basis that you have records consisting of pairs of lines, the first line being a header containing field names specific to that record with the second line containing the corresponding data. It would help of you let us know which (if any) was correct. -- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list
Re: Python's CSV reader
In article [EMAIL PROTECTED], Stephan [EMAIL PROTECTED] writes I'm fairly new to python and am working on parsing some delimited text files. I noticed that there's a nice CSV reading/writing module included in the libraries. My data files however, are odd in that they are composed of lines with alternating formats. (Essentially the rows are a header record and a corresponding detail record on the next line. Each line type has a different number of fields.) Can the CSV module be coerced to read two line formats at once or am I better off using read and split? Thanks for your insight, Stephan The csv module should be suitable. The reader just takes each line, parses it, then returns a list of strings. It doesn't matter if different lines have different numbers of fields. To get an idea of what I mean, try something like the following (untested): import csv reader = csv.reader(open(filename)) while True: # Read next header line, if there isn't one then exit the loop header = reader.next() if not header: break # Assume that there is a detail line if the preceding # header line exists detail = reader.next() # Print the parsed data print '-' * 40 print Header (%d fields): %s % (len(header), header) print Detail (%d fields): %s % (len(detail), detail) You could wrap this up into a class which returns (header, detail) pairs and does better error handling, but the above code should illustrate the basics. -- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list
Re: Coding style article with interesting section on white space
In article [EMAIL PROTECTED], Alex Martelli [EMAIL PROTECTED] writes You're saying that using a different and better compiler cannot speed the execution of your Fortran program by 25% when you move it from one platform to another...?! This seems totally absurd to me, and yet I see no other way to interpret this assertion about Fortran programs not suffering -- you're looking at it as a performance _hit_ but of course it might just as well be construed as a performance _boost_ depending on the direction you're moving your programs. I think that upon mature consideration you will want to retract this assertion, and admit that it IS perfectly possible for the same Fortran program on the same hardware to have performance that differs by 25% or more depending on how good the optimizers of different compilers happen to be for that particular code, and therefore that, whatever point you thought you were making here, it's in fact totally worthless. Look at the Fortran compiler benchmarks here: http://www.polyhedron.co.uk/compare/win32/f77bench_p4.html for some concrete evidence to support Alex's point. You will see that the average performance across different benchmarks of different Fortran compilers on the same platform can be as much a factor of two. Variation of individual benchmarks as much as a factor of three. Some of you might be surprised at how many different Fortran compilers are available! -- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list
Fuzzy matching of postal addresses [1/1]
In case anyone is interested, here is the latest. I implemented an edit distance technique based on tokens. This incorporated a number of the ideas discussed in the thread. It works pretty well on my data. I'm getting about 95% matching now, compared with 90% for the simple technique I originally tried. So I have matched half the outstanding cases. I have spotted very few false positives, and very few cases where I could make a match manually. Although I suspect the code could still be improved. It took a bit of head scratching to work out how to incorporate concatenation of tokens into the dynamic programming method, but I think I got there! At least my test cases seem to work! # # First attempt at a fuzzy compare of two addresses using a form of Edit Distance algorithm on tokens # v0.5 # Andrew McLean, 23 January 2005 # # The main routine editDistance takes two lists of tokens and returns a distance measure # Allowed edits are replace, insert, delete and concatenate a pair of tokens. # The cost of these operations depends on the value of the tokens and their position within the sequence. # # The tokens consist of a tuple containing a string representation and it's soundex encoding # The program assumes that some normalisation has already been carried out, for instance converting all # text to lowercase. # # The routine has undergone limited testing, but it appeared to work quite well for my application, # with a reasonablly low level of false positives # # I'm not convinced that I have got the logic quite right in the dynamic programming, dealing correctly with # token pair concatenation is non-trivial. # # It would be neater to have an out of band flag for impossible/infinite cost. Could abstract this into a # Cost class. But I am a bit concerned about efficiency. Could use a negative number for infinite cost # and use a modified min function to reflect this. The approach I am using with a very big number for INFINITY # will be fine for any sensible tokens relating to addresses. # # The code could probably do with more test caases. # # Also, if I was going to refactor the code I would either # 1. Make this a bit more object oriented by introducing a Token class. # 2. Not precompute the soundex encodings. It is probably sufficient to use a memoized soundex routine. # # Standard library module imports import re, sys, os # Kludge! sys.path.append(os.path.abspath('../ZODB')) # Paul Moore's Memoize class from Python cookbook # http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52201 from memoize import Memoize # Public domain soundex implementation, by Skip Montanaro, 21 December 2000 # http://manatee.mojam.com/~skip/python/soundex.py import soundex # Memoize soundex for speed get_soundex = Memoize(soundex.get_soundex) # List of numbers spelt out numbers_spelt = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty'] digitsToTextMap = dict([(str(i+1), numbers_spelt[i]) for i in range(len(numbers_spelt))]) # Set up a dictonary mapping abbreviations to the full text version # Each abbreviation can map to a single string expansion or a list of possible expansions abbrev = {'cott': 'cottage', 'rd':'road', 'fm':'farm', 'st':['street', 'saint']} # ... include number to text mapping abbrev.update(digitsToTextMap) # Regular expression to find tokens containing a number contains_number = re.compile('\d') # ... could include numbers spelt in words, but it causes more problems than it solves # ... as lots of words include character sequences like one ##contains_number = re.compile('\d|'+'|'.join(numbers_spelt)) # List of minor tokens minorTokenList = ['the', 'at'] # Various weights POSITION_WEIGHT = 1.2 REPLACEMENT_COST = 50 SOUNDEX_MATCH_COST = 0.95 PLURAL_COST = 0.25 ABBREV_COST = 0.25 INSERT_TOKEN_WITH_NUMBER = 5 INSERT_TOKEN_AFTER_NUMBER = 10 INSERT_TOKEN = 2 INSERT_MINOR_TOKEN = 0.5 CONCAT_COST = 0.2 INFINITY = 1000 def containsNumber(token): Does the token contain any digits return contains_number.match(token[0]) # Memoize it containsNumber = Memoize(containsNumber) def replaceCost(token1, token2, pos, allowSoundex=True): Cost of replacing token1 with token2 (or vice versa) at a specific normalised position within the sequence # Make sure token1 is shortest m, n = len(token1[0]), len(token2[0]) if m n: token1, token2 = token2, token1 m, n = n, m # Look for exact matches if token1[0] == token2[0]: return 0 # Look for plurals if (n - m == 1) and (token2[0] == token1[0] + 's'): return PLURAL_COST # Look for abbreviations try: expansion = abbrev[token1[0]] except KeyError: pass else: if type(expansion) == list and token2[0] in expansion: return ABBREV_COST
Re: Fuzzy matching of postal addresses [1/1]
In article [EMAIL PROTECTED], John Machin [EMAIL PROTECTED] writes Andrew McLean wrote: In case anyone is interested, here is the latest. def insCost(tokenList, indx, pos): The cost of inserting a specific token at a specific normalised position along the sequence. if containsNumber(tokenList[indx]): return INSERT_TOKEN_WITH_NUMBER + POSITION_WEIGHT * (1 - pos) elif indx 0 and containsNumber(tokenList[indx-1]): return INSERT_TOKEN_AFTER_NUMBER + POSITION_WEIGHT * (1 - pos) elif tokenList[indx][0] in minorTokenList: return INSERT_MINOR_TOKEN else: return INSERT_TOKEN + POSITION_WEIGHT * (1 - pos) def delCost(tokenList, indx, pos): The cost of deleting a specific token at a specific normalised position along the sequence. This is exactly the same cost as inserting a token. return insCost(tokenList, indx, pos) Functions are first-class citizens of Pythonia -- so just do this: delCost = insCost Actually, the code used to look like that. I think I changed it so that it would look clearer. But perhaps that was a bad idea. Re speed generally: (1) How many addresses in each list and how long is it taking? On what sort of configuration? (2) Have you considered using pysco -- if not running on x86 architecture, consider exporting your files to a grunty PC and doing the match there. (3) Have you considered some relatively fast filter to pre-qualify pairs of addresses before you pass the pair to your relatively slow routine? There are approx. 50,000 addresses in each list. At the moment the processing assumes all the postcodes are correct, and only compares addresses with matching postcodes. This makes it a lot faster. But may miss some cases of mismatched postcodes. Also it does two passes. One looking for exact matches of token sequences. This deals with about half the cases. Only then do I employ the, more expensive, edit distance technique. Overall, the program runs in less than half an hour. Specifically it takes about 60s per thousand addresses, which requires an average of about 8 calls to editDistance per address. Psyco.full() reduced the 60s to 45s. I'll only try optimisation if I need to use it much more. Soundex?? To put it bluntly, the _only_ problem to which soundex is the preferred solution is genealogy searching in the US census records, and even then one needs to know what varieties of the algorithm were in use at what times. I thought you said your addresses came from authoritative sources. You have phonetic errors? Can you give some examples of pairs of tokens that illustrate the problem you are trying to overcome with soundex? I'm sure that in retrospect Soundex might not be a good choice. The misspellings tend to be minor, e.g. Kitwhistle and KITTWHISTLE Tythe and TITHE I was tempted by an edit distance technique on the tokens, but would prefer a hash based method for efficiency reasons. Back to speed again: When you look carefully at the dynamic programming algorithm for edit distance, you will note that it is _not_ necessary to instantiate the whole NxM matrix -- it only ever refers to the current row and the previous row. What does space saving have to do with speed, you ask? Well, Python is not FORTRAN; it takes considerable effort to evaluate d[i][j]. A relatively simple trick is to keep 2 rows and swap (the pointers to) them each time around the outer loop. At the expense of a little more complexity, one can reduce this to one row and 3 variables (north, northwest, and west) corresponding to d[i-1][j], d[i-1][j-1], and d[i][j-1] -- but I'd suggest the simple way first. Hope some of this helps, Thanks for that. The first edit distance algorithm I looked at did it that way. But I based my code on a version of the algorithm I could actually follow ;-). Getting the concatenation bit right was non-trivial and it was useful to store all of d for debugging purposes. As to Python not being Fortran. You've found me out. The three languages I am most comfortable with are Fortran, Matlab and Python. It did occur to me that numarray might be a more efficient way of dealing with a 4-dimensional array, but the arrays aren't very big, so the overhead in setting them up might be significant. The simplest optimisation would be to replace the two indices used to deal with concatenation by four explicit variables. And then, as you said, I could just store the three last rows, and avoid any multiple indexing. As with all these potential optimisations, you don't know until you try them. -- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list
Re: Driving win32 GUIs with Python
In article [EMAIL PROTECTED], Fredrik Lundh [EMAIL PROTECTED] writes Andrew McLean wrote: I have a requirement to drive a Windows GUI program from a Python Script. The program was originally a DOS program written in Turbo Pascal, and was recently translated to Delphi. I don't think it exposes an OLE or other automation interface. I don't have access to the source. A bit of Googling turned up some blog entries, which look useful: http://www.brunningonline.net/simon/blog/archives/000652.html Before ploughing ahead I wanted to check whether any useful Python tools are available now, which weren't when the articles above were written. watsup is winGuiAuto plus lots of other stuff (focused on testing): http://www.tizmoi.net/watsup/intro.html /F Excellent. That look like just the sort of thing I was looking for. -- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list