Re: What extended ASCII character set uses 0x9D?

2017-08-18 Thread John Nagle

On 08/17/2017 05:53 PM, Chris Angelico wrote:

On Fri, Aug 18, 2017 at 10:30 AM, John Nagle  wrote:

On 08/17/2017 05:14 PM, John Nagle wrote:

  I'm cleaning up some data which has text description fields from
multiple sources.

A few more cases:

bytearray(b'\xe5\x81ukasz zmywaczyk')


This one has to be Polish, and the first character should be the
letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
very similar to the E5 81 that you have.

So here's an insane theory: something attempted to lower-case the byte
stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
like 0x45 or "E", which lower-cases by having 32 added to it, yielding
0xE5. Reversing this transformation yields sane data for several of
your strings - they then decode as UTF-8:

miguel Ángel santos
lidija kmetič
Łukasz zmywaczyk
jiří urbančík
Ľubomír mičko
petr urbančík


   You're exactly right.  The database has columns "name" and
"normalized name".  Normalizing the name was done by forcing it
to lower  case as if in ASCII, even for UTF-8. That resulted in
errors like

KACMAZLAR MEKANİK  -> kacmazlar mekanä°k

Anita Calçados -> anita calã§ados

Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta

   The "name" field is OK; it's just the "normalized name" field
that is sometimes garbaged. Now that I know this, and have properly
captured the "name" field in UTF-8 where appropriate, I can
regenerate the "normalized name" field.  MySQL/MariaDB know how
to lower-case UTF-8 properly.

   Clean data at last.  Thanks.

   The database, by the way, is a historical snapshot of startup
funding, from Crunchbase.

John Nagle
--
https://mail.python.org/mailman/listinfo/python-list


Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle

On 08/17/2017 10:12 PM, Ian Kelly wrote:


Here's some more 0x9d usage, each from a different data item:


Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"


This one seems like a good hint since \x99 here looks like it should
be an apostrophe. But what character set has an apostrophe there? The
best I can come up with is that 0xE2 0x80 0x99 is "right single
quotation mark" in UTF-8. Also known as the "smart apostrophe", so it
could have been entered by a word processor.

The problem is that if that's what it is, then two out of the three
bytes are outright missing. If the same thing happened to \x9d then
who knows what's missing from it?

One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.


I was wondering if it was a signal to some word processor to
apply smart quote handling.


This has me puzzled.  It's often, but not always after a close quote.
"TM" or "(R)" might make sense, but what non-Unicode character set
has those.  And  "green"(tm) makes no sense.


CP-1252 has ™ at \x99, perhaps coincidentally. CP-1252 and Latin-1
both have ® at \xae.


   That's helpful.  All those text snippets failed Windows-1252
decoding, though, because 0x9d isn't in Windows-1252.

   I'm coming around to the idea that some of these snippets
have been previously mis-converted, which is why they make no sense.
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm, that's a reasonable
assumption.  Thanks for looking at this, everyone.  If a string won't
parse as either UTF-8 or Windows-1252, I'm just going to convert the
bogus stuff to the Unicode replacement character. I might remove
0x9d chars, since that never seems to affect readability.

John Nagle

--
https://mail.python.org/mailman/listinfo/python-list


Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at 
10:30 AM, John Nagle  wrote:

>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>>   I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zmywaczyk')
>
> This one has to be Polish, and the first character should be the
> letter Ł U+0141 or ł U+0142. In UTF-8, U+0141 becomes C5 81, which is
> very similar to the E5 81 that you have.
>
> So here's an insane theory: something attempted to lower-case the byte
> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
> like 0x45 or "E", which lower-cases by having 32 added to it, yielding
> 0xE5. Reversing this transformation yields sane data for several of
> your strings - they then decode as UTF-8:
>
> miguel Ángel santos
> lidija kmetič
> Łukasz zmywaczyk
> jiří urbančík
> Ľubomír mičko
> petr urbančík

   I think you're right for those.  I'm working from a MySQL dump of
supposedly LATIN-1 data, but LATIN-1 will accept anything. I've
found UTF-8 and Windows-2152 in there. It's quite possble that someone
lower-cased UTF-8 stored in a LATIN-1 field.  There are lots of
questions on the web which complain about getting a Python decode error
on 0x9d, and the usual answer is "Use Latin-1". But that doesn't really
decode properly, it just doesn't generate an exception.

> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.

   The 0x9d thing seems unrelated to the Polish names thing.  0x9d
shows up in the middle of English text that's otherwise ASCII.
Is this something that can appear as a result of cutting and
pasting from Microsoft Word?

   I'd like to get 0x9d right, because it comes up a lot. The
Polish name thing is rare.  There's only about a dozen of those
in 400MB of database dump. There are hundreds of 0x9d hits.

Here's some more 0x9d usage, each from a different data item:


Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The 
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"


for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\" 
Everything


Netwise Depot is  a \\"One Stop Web Shop\\"\x9d that provides

sustainable \\"green\\"\x9d living

are looking for a \\"Do It for Me\\"\x9d solution


This has me puzzled.  It's often, but not always after a close quote.
"TM" or "(R)" might make sense, but what non-Unicode character set
has those.  And  "green"(tm) makes no sense.

John Nagle


--
https://mail.python.org/mailman/listinfo/python-list


Re: What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle

On 08/17/2017 05:14 PM, John Nagle wrote:
>  I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:

bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster')
bytearray(b'ji\xe5\x99\xe3\xad urban\xe4\x8d\xe3\xadk')
bytearray(b'\xe4\xbdubom\xe3\xadr mi\xe4\x8dko')
bytearray(b'petr urban\xe4\x8d\xe3\xadk')

0x9d is the most common; that occurs in English text. The others
seem to be in some Eastern European character set.

Understand, there's no metadata available to disambiguate this. What I
have is a big CSV file in which different character sets are mixed.
Each field has a uniform character set, so I need character set
detection on a per-field basis.

John Nagle

--
https://mail.python.org/mailman/listinfo/python-list


What extended ASCII character set uses 0x9D?

2017-08-17 Thread John Nagle

I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding which character
set best represents what's there.

   Here's a hard case:

 g1 = bytearray(b'\\"Perfect Gift Idea\\"\x9d Each time')

 g1.decode("utf8")
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 
21: invalid start byte


  g1.decode("windows-1252")
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 
21: character maps to 


0x9d is unmapped in "windows-1252", according to

https://en.wikipedia.org/wiki/Windows-1252

So the Python codec isn't wrong here.

Trying "latin-1"

  g1.decode("latin-1")
 '\\"Perfect Gift Idea\\"\x9d Each time'

That just converts 0x9d in the input to 0x9d in Unicode.
That's "Operating System Command" (the "Windows" key?)
That's clearly wrong; some kind of quote was intended.
Any ideas?


John Nagle
--
https://mail.python.org/mailman/listinfo/python-list


Unicode support in Python 2.7.8 - 16 bit

2017-03-07 Thread John Nagle

   How do I test if a Python 2.7.8 build was built for 32-bit
Unicode?  (I'm dealing with shared hosting, and I'm stuck
with their provided versions.)

If I give this to Python 2.7.x:

sy = u'\U0001f60f'

len(sy) is 1 on a Ubuntu 14.04LTS machine, but 2 on the
Red Hat shared hosting machine.  I assume "1" indicates
32-bit Unicode capability, and "2" indicates 16-bit.
It looks like  Python 2.x in 16-bit mode is using a UTF-16
pair encoding, like Java. Is that right?  Is it documented
somewhere?

(Annoyingly, while the shared host has a Python 3, it's
3.2.3, which rejects "u" Unicode string constants and
has other problems in the MySQL area.)

John Nagle


--
https://mail.python.org/mailman/listinfo/python-list


Who still supports recent Python on shared hosting

2017-03-05 Thread John Nagle
I'm looking for shared hosting that supports
at least Python 3.4.

Hostgator: Highest version is Python 3.2.
Dreamhost: Highest version is Python 2.7.
Bluehost: Install Python yourself.
InMotion: Their documentation says 2.6.

Is Python on shared hosting dead?
I don't need a whole VM and something I
have to sysadmin, just a small shared
hosting account.

    John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


input vs. readline

2016-07-08 Thread John Nagle
   If "readline" is imported, "input" gets "readline" capabilities.
It also loses the ability to import control characters.  It doesn't
matter where "readline" is imported; an import in some library
module can trigger this.  You can try this with a simple test
case:

   print(repr(input()))

as a .py file, run in a console.  Try typing "aaaESCbbb".
On Windows 7, output is "bbb".  On Linux, it's "aaa\x1".

So it looks like "readline" is implicitly imported on Windows.

   I have a multi-threaded Python program which recognizes ESC as
a command to stop something.  This works on Linux, but not on
Windows.  Apparently something in Windows land pulls in "readline".

   What's the best way to get input from the console (not any
enclosing shell script) that's cross-platform, cross-version
(Python 2.7, 3.x), and doesn't do "readline" processing?

(No, I don't want to use signals, a GUI, etc.  This is simulating
a serial input device while logging messages appear.  It's a debug
facility to be able to type input in the console window.)

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: py2exe crashes on simple program

2016-07-05 Thread John Nagle
On 7/4/2016 11:13 PM, Steven D'Aprano wrote:
> If you change it to "library.exe" does it work? Also, I consider this
> a bug in py2exe: - it's an abuse of assert, using it to check
> user-supplied input; - it's a failing assertion, which by definition
> is a bug.

   I'm not trying to build "library.zip". That's a work
file py2exe created.  If I delete it, it's re-created by py2exe.

The problem seems to be that my "setup.py" file didn't include
a "console" entry, which tells py2exe the build target.
Apparently, without that py2exe tries to build something bogus and
blows up.

After fixing that, the next error is

Building 'dist\baudotrss.exe'.
error: [Errno 2] No such file or directory: 'C:\\Program
Files\\Python35\\lib\\site-packages\\py2exe\\run-py3.5-win-amd64.exe'

Looks like Pip installed (yesterday) a version of py2exe that doesn't
support Python 3.5.  The py2exe directory contains
"run-py3.3-win-amd64.exe" and "run-py3.4-win-amd64.exe", but
not 3.5 versions.

That's what PyPi says at "https://pypi.python.org/pypi/py2exe";.
The last version of py2exe was uploaded two years ago (2014-05-09)
and is for Python 3.4.  So of course it doesn't have the 3.5 binary
executable it needs.

Known problem. Stack Overflow reports py2exe is now broken for Python
3.5.  Apparently it's a non-trivial fix, too.

http://stackoverflow.com/questions/32963057/is-there-a-py2exe-version-thats-compatible-with-python-3-5

cx_freeze has been suggested as an alternative, but its own
documents indicate it's only been tested through Python 3.4.
Someone reported success with a development version.

I guess people don't create Python executables much.

John Nagle


-- 
https://mail.python.org/mailman/listinfo/python-list


py2exe crashes on simple program

2016-07-04 Thread John Nagle
  I'm trying to create an executable with py2exe.
The program runs fine in interpretive mode.  But
when I try to build an executable, py2exe crashes with
an assertion error. See below.

  This is an all-Python program; no binary modules
other than ones that come with the Python 3.5.2
distribution. Running "python setup.py bdist"
works, so "setup.py" is sane.  What's giving
py2exe trouble?

U:\>python setup.py py2exe
running py2exe
running build_py
Building shared code archive 'dist\library.zip'.
Traceback (most recent call last):
  File "setup.py", line 14, in 
packages=['baudotrss'],
  File "C:\Program Files\Python35\lib\distutils\core.py", line 148, in setup
dist.run_commands()
  File "C:\Program Files\Python35\lib\distutils\dist.py", line 955, in
run_commands
self.run_command(cmd)
  File "C:\Program Files\Python35\lib\distutils\dist.py", line 974, in
run_command
cmd_obj.run()
  File "C:\Program
Files\Python35\lib\site-packages\py2exe\distutils_buildexe.py", line
188, in run
self._run()
  File "C:\Program
Files\Python35\lib\site-packages\py2exe\distutils_buildexe.py", line
268, in _run
builder.build()
  File "C:\Program Files\Python35\lib\site-packages\py2exe\runtime.py",
line 261, in build
self.build_archive(libpath, delete_existing_resources=True)
  File "C:\Program Files\Python35\lib\site-packages\py2exe\runtime.py",
line 426, in build_archive
assert mod.__file__.endswith(EXTENSION_SUFFIXES[0])
AssertionError


Python 3.5.2 / Win7 / AMD64.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3 lack of support for fcgi/wsgi.

2015-03-30 Thread John Nagle
On 3/29/2015 7:11 PM, John Nagle wrote:
> Meanwhile, I've found two more variants on "flup"
> 
>   https://pypi.python.org/pypi/flipflop
>   https://pypi.python.org/pypi/flup6
> 
> All of these are descended from the original "flup" code base.
> 
> PyPi also has
> 
>   fcgi-python (Python 2.6, Windows only.)
>   fcgiapp (circa 2005)
>   superfcgi (circa 2009)
> 
> Those can probably be ignored.
> 
> One of the "flup" variants may do the job, but since there
> are so many, and no single version has won out, testing is
> necessary.  "flipflop" looks promising, simply because the
> author took all the code out that you don't need on a server.

   "flipflop" works well with Apache.  It does log
"WARNING: SCRIPT_NAME does not match REQUEST_URI" for any URL
renamed using mod_rename with Apache, but other than that,
it seems to do the job.  The warning message was copied
over from "flup", and there's an issue for it for one of the
"flup" variants.  So I referenced that issue for "flipflop":

https://github.com/Kozea/flipflop/issues

That's part of the problem of having all those forks - now
each bug has to be fixed in each fork.

After all this, the production system is now running entirely
on Python 3.

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3 lack of support for fcgi/wsgi.

2015-03-29 Thread John Nagle
On 3/29/2015 6:03 PM, Paul Rubin wrote:
> Those questions seem unfair to me.  Nagle posted an experience report
> about a real-world project to migrate a Python 2 codebase to Python 3.
> He reported hitting more snags than some of us might expect purely from
> the Python 3 propaganda ("oh, just run the 2to3 utility and it does
> everything for you").  The report presented info worth considering for
> anyone thinking of doing a 2-to-3 migration of their own, or maybe even
> choosing between 2 and 3 for a new project.  I find reports like that to
> be valuable whether or not they suggest fixes for the snags.

Thanks.

Meanwhile, I've found two more variants on "flup"

https://pypi.python.org/pypi/flipflop
https://pypi.python.org/pypi/flup6

All of these are descended from the original "flup" code base.

PyPi also has

fcgi-python (Python 2.6, Windows only.)
fcgiapp (circa 2005)
superfcgi (circa 2009)

Those can probably be ignored.

One of the "flup" variants may do the job, but since there
are so many, and no single version has won out, testing is
necessary.  "flipflop" looks promising, simply because the
author took all the code out that you don't need on a server.

CPAN, the Perl module archive, has some curation and testing.
PyPi lacks that, which is how we end up with situations like
this, where there are 11 ways to do something, most of which
don't work.

Incidentally, in my last report, I reported problems with BS4,
PyMySQL, and Pickle. I now have workarounds for all of those,
but not fixes. The bug reports I listed last time contain the
workaround code.

John Nagle





-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3 lack of support for fcgi/wsgi.

2015-03-29 Thread John Nagle
On 3/29/2015 1:19 PM, John Nagle wrote:
> On 3/29/2015 12:11 PM, Ben Finney wrote:
>> John Nagle  writes:
>>
>>> The Python 3 documentation at
>>> https://docs.python.org/3/howto/webservers.html
>>>
>>> recommends "flup"
>>
>> I disagree. In a section where it describes FastCGI, it presents a tiny
>> example as a way to test the packages installed. The example happens to
>> use ‘flup’.
>>
>> That's quite different from a recommendation.
>>
>>> I get the feeling, again, that nobody actually uses this stuff.
> 
> So do others. See "http://www.slideshare.net/mitsuhiko/wsgi-on-python-3";
> 
> "A talk about the current state of WSGI on Python 3. Warning:
> depressing. But it does not have to stay that way"
> 
> "wsgiref on Python 3 is just broken."
> 
> "Python 3 that is supposed to make unicode easier is causing a lot more
> problems than unicode environments on Python 2"
> 
> "The Python 3 stdlib is currently incredible broken but because there
> are so few users, these bugs stay under the radar."
> 
> That was written in 2010.  Most of that stuff is still broken.
> Here's his detailed critique:
> 
> http://lucumr.pocoo.org/2010/5/25/wsgi-on-python-3/
> 
>> You have found yet another poorly-maintained package which is not at all
>> the responsibility of Python 3.
>> Why are you discussing it as though Python 3 is at fault?
> 
>That's a denial problem.  Uncritical fanboys are part of the problem,
> not part of the solution.
> 
>Practical problems: the version of "flup" on PyPi is so out of date
> as to be useless.  The original author abandoned the software.  There
> are at least six forks of "flup" on Github:
> 
> https://github.com/Pyha/flup-py3.3
> https://github.com/Janno/flup-py3.3
> https://github.com/pquentin/flup-py3
> https://github.com/SmartReceipt/flup-server
> https://github.com/dnephin/TreeOrg/tree/master/app-root/flup
> https://github.com/noxan/flup
> 
> The first three look reasonably promising; the last three look
> abandoned.  But why are there so many, and what are the
> differences between the first three?   Probably nobody
> was able to fix all the Python 3 related problems documented by
> Ronacher in 2010.  None of the versions have much usage.  Nobody
> thought their version was good enough to push it to Pypi.
> 
> All those people had to struggle to try to get a basic capability for
> web development using Python to work. To use WSGI with Python 3, you
> need to do a lot of work.  Or stay with Python 2.
> 
> Python 3 still isn't ready for prime time.
> 
>   John Nagle
> 

-- 
https://mail.python.org/mailman/listinfo/python-list


Python 3 lack of support for fcgi/wsgi.

2015-03-29 Thread John Nagle
The Python 2 module "fcgi" is gone in Python 3.

The Python 3 documentation at

https://docs.python.org/3/howto/webservers.html

recommends "flup" and links here:

https://pypi.python.org/pypi/flup/1.0

That hasn't been updated since 2007, and the SVN repository linked there
is gone.  The recommended version is abandoned. pip3
tries to install version 1.0.2, from 2009.  That's here:
https://pypi.python.org/pypi/flup/1.0.2 That version is supported
only for Python 2.5 and 2.6.

There's a later version on Github:

https://github.com/Pyha/flup-py3.3

But that's not what "pip3" is installing.

I get the feeling, again, that nobody actually uses this stuff.

"pip3" seems perfectly happy to install modules that don't work with
Python 3.  Try "pip3 install dnspython", for example.  You need
"dnspython3", but pip3 doesn't know that.

There's "wsgiref", which looks more promising, but has a different
interface.  That's not what the Python documentation recommends as
the first choice, but it's a standard module.

I keep thinking I'm almost done with Python 3 hell, but then I
get screwed by Python 3 again.

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


Workaround for BeautifulSoup/HTML5parser bug

2015-03-21 Thread John Nagle
   BeautifulSoup 4 and HTML5parser are known to not play well together.
I have a workaround for that.  See

https://bugs.launchpad.net/beautifulsoup/+bug/1430633

This isn't a fix; it's a postprocessor to fix broken BS4 trees.
This is for use until the BS4 maintainers fix the bug.

    John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 2 to 3 conversion - embrace the pain

2015-03-17 Thread John Nagle
On 3/15/2015 4:43 PM, Roy Smith wrote:
> In article ,
>  Mario Figueiredo  wrote:
> 
>> What makes you think your anecdotal bugs constitute any sort of
>> evidence this programming language isn't ready to be used by the
>> public?
> 
> There's several levels of "ready".
> 
> I'm sure the core language is more than ready for production use for a 
> project starting from scratch which doesn't rely on any third party 
> libraries.
> 
> The next step up on the "ready" ladder would be a new project which will 
> require third-party libraries.  And that pretty much means any 
> non-trivial project.  I'm reasonably confident that most common use 
> cases can now be covered by p3-ready third party modules. 

   If only that were true.  Look what I'm reporting bugs on:

ssl - a core Python module.
cPickle - a core Python module.
pymysql - the pure-Python recommended way to talk to MySQL.
bs4/html5parser - a popular parser for HTML5

We're not in exotic territory here.  I've done lots of exotic
projects, but this isn't one of them.

There's progress.  The fix to "ssl" has been made and committed.
I have a workaround for the cPickle bug - use pure-Python Pickle.
I have a workaround for the pymysql problem, and a permanent fix
is going into the next release of pymysql. I have a tiny test case
for bs4/html5parser that reproduces the bug on a tiny snippet of
HTML, and that's been uploaded to the BS4 issues tracker.
I don't have a workaround for that.

All this has cost me about two weeks of work so far.

The "everything is just fine" comments are not helpful.
Denial is not a river in Egypt.

John Nagle


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 2 to 3 conversion - embrace the pain

2015-03-15 Thread John Nagle
On 3/14/2015 1:00 AM, Marko Rauhamaa wrote:
> John Nagle :
>>   I'm approaching the end of converting a large system from Python 2
>> to Python 3. Here's why you don't want to do this.
> 
> A nice report, thanks. Shows that the slowness of Python 3 adoption is
> not only social inertia.
> Marko

Thanks.

Some of the bugs I listed are so easy to hit that I suspect those
packages aren't used much.  Those bugs should have been found years
ago.  Fixed, even.  I shouldn't be discovering them in 2015.

I appreciate all the effort put in by developers in fixing these
problems.  Python 3 is still a long way from being ready for prime
time, though.

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 2 to 3 conversion - embrace the pain

2015-03-13 Thread John Nagle
On 3/13/2015 3:27 PM, INADA Naoki wrote:
> Hi, John.  I'm maintainer of PyMySQL.
> 
> I'm sorry about bug of PyMySQL.  But the bug is completely unrelated
> to Python 3.
> You may encounter the bug on Python 2 too.

   True.  But much of the pain of converting to Python 3
comes from having to switch packages because the Python 2
package didn't make it to Python 3.

   All the bugs I'm discussing reflect forced package
changes or upgrades.  None were voluntary on my part.

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


Python 2 to 3 conversion - embrace the pain

2015-03-13 Thread John Nagle
  I'm approaching the end of converting a large system from Python 2 to
Python 3.  Here's why you don't want to do this.

  The language changes aren't that bad, and they're known and
documented.  It's the package changes that are the problem.
Discovering and fixing all the new bugs takes a while.


BeautifulSoup:

BeautifulSoup 3 has been phased out. I had my own version of
BeautifulSoup 3, modified for greater robustness.  But that was
years ago.  So I converted to BeautifulSoup 4, as the documentation
says to do.

The HTML5parser module is claimed to parse as a browser does, with
all the error tolerance specified in the HTML5 spec. (The spec
actually specifies how to handle bad HTML consistently across
browsers in great detail, and HTML5parser has code in it for that.)

It doesn't deliver on that promise, though. Some sites crash
BeautifulSoup 4/HTML5parser.  Try "kroger.com", which has HTML with
.  The parse tree constructed has a bad link,
and trying to use the parse tree results in exceptions.
Submitted bug report.  Appears to be another case of
a known bug.  No workaround at this time.

https://bugs.launchpad.net/beautifulsoup/+bug/1270611
https://bugs.launchpad.net/beautifulsoup/+bug/1430633


PyMySQL:

"Pymysql is a pure Python drop-in replacement for MySQLdb".
Sounds good.  Then I discover that LOAD DATA LOCAL wasn't
implemented in the version on PyPi.  It's on Github, though,
and I got the authors to push that out to PyPi.  It
works on test cases.  But it doesn't work on a big job,
because the default size of MySQL packets was set to 16MB.
This made the LOAD DATA LOCAL code try to send the entire
file being loaded as one giant MySQL packet.  Unless you
configure the MySQL server with 16MB buffers, this fails, with
an obscure "server has gone away" message.  Found the
problem, came up with a workaround, submitted a bug report,
and it's being fixed.

https://github.com/PyMySQL/PyMySQL/issues/317


SSL:

All the new TLS/SSL support is in Python 3. That's good.
Unfortunately, using Firefox's set of SSL certs, some
important sites (such as "verisign.com") don't validate.
This turned out to be a complex problem involving Verisign
cross-signing a certificate, which created a certificate
hierarchy that some versions of OpenSSL can't handle.
There's now a version of OpenSSL that can handle it, but
the Python library has to make a call to use it, and
that's going in but isn't deployed yet.  This bug
resulted in much finger-pointing between the Python
and OpenSSL developers, the Mozilla certificate store
maintainers, and Verisign.  It's now been sorted out,
but not all the fixes are deployed.  Because "ssl" is
a core Python module, this will remain broken until the
next Python release, on both the 2.7 and 3.4 lines.

Also, for no particularly good reason, the exception
"SSL.CertificateError" is not a subclass of "SSL.Error",
resulting in a routine exception not being recognized.

Bug reports submitted for both OpenSSL and Python SSL.
Much discussion.  Problem fixed, but fix is in next
version of Python.  No workaround at this time.

http://bugs.python.org/issue23476


Pickle:

As I just posted recently, CPickle on Python 3.4 seems to
have a memory corruption bug.  Pure-Python Pickle is fine.
So a workaround is possible.  Bug report submitted.

http://bugs.python.org/issue23655


Converting a large application program to Python 3
thus required diagnosing four library bugs and filing
bug reports on all of them.  Workarounds are known
for two of the problems.  I can't deploy the Python 3
version on the servers yet.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 "pickle" vs. stdin/stdout - unable to get clean byte streams in Python 3

2015-03-12 Thread John Nagle
On 3/12/2015 5:18 PM, John Nagle wrote:
> On 3/12/2015 2:56 PM, Cameron Simpson wrote:
>> On 12Mar2015 12:55, John Nagle  wrote:
>>> I have working code from Python 2 which uses "pickle" to talk to a
>>> subprocess via stdin/stdio.  I'm trying to make that work in Python
>>> 3.

   I'm starting to think that the "cpickle" module, which Python 3
uses by default, has a problem. After the program has been
running for a while, I start seeing errors such as

  File "C:\projects\sitetruth\InfoSiteRating.py", line 200, in scansite
if len(self.badbusinessinfo) > 0 :  # if bad stuff
NameError: name 'len' is not defined

which ought to be impossible in Python, and

  File "C:\projects\sitetruth\subprocesscall.py", line 129, in send
self.writer.dump(args)  # send data
OSError: [Errno 22] Invalid argument

from somewhere deep inside CPickle.

I got

  File "C:\projects\sitetruth\InfoSiteRating.py", line 223, in
get_rating_text
(ratingsmalliconurl, ratinglargiconurl, ratingalttext) =
DetailsPageBuilder.getratingiconinfo(rating)
NameError: name 'DetailsPageBuilder' is not defined
(That's an imported module.  It worked earlier in the run.)

and finally, even after I deleted all .pyc files and all Python
cache directories:  

Fatal Python error: GC object already tracked

Current thread 0x1a14 (most recent call first):
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 411
in description
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 1248
in _get_descriptions
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 1182
in _read_result_packet
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 1132
in read
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 929
in _read_query_result
  File "C:\python34\lib\site-packages\pymysql\connections.py", line 768
in query
  File "C:\python34\lib\site-packages\pymysql\cursors.py", line 282 in
_query
  File "C:\python34\lib\site-packages\pymysql\cursors.py", line 134 in
execute
  File "C:\projects\sitetruth\domaincacheitem.py", line 128 in select
  File "C:\projects\sitetruth\domaincache.py", line 30 in search
  File "C:\projects\sitetruth\ratesite.py", line 31 in ratedomain
  File "C:\projects\sitetruth\RatingProcess.py", line 68 in call
  File "C:\projects\sitetruth\subprocesscall.py", line 140 in docall
  File "C:\projects\sitetruth\subprocesscall.py", line 158 in run
  File "C:\projects\sitetruth\RatingProcess.py", line 89 in main
  File "C:\projects\sitetruth\RatingProcess.py", line 95 in 

That's a definite memory error.

So something is corrupting memory.  Probably CPickle.

All my code is in Python. Every library module came in via "pip", into a
clean Python 3.4.3 (32 bit) installation on Win7/x86-64.
Currently installed packages:

beautifulsoup4 (4.3.2)
dnspython3 (1.12.0)
html5lib (0.999)
pip (6.0.8)
PyMySQL (0.6.6)
pyparsing (2.0.3)
setuptools (12.0.5)
six (1.9.0)

And it works fine with Python 2.7.9.

Is there some way to force the use of the pure Python pickle module?
My guess is that there's something about reusing "pickle" instances
that botches memory uses in CPython 3's C code for "cpickle".

John Nagle  

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python3 "pickle" vs. stdin/stdout - unable to get clean byte streams in Python 3

2015-03-12 Thread John Nagle
On 3/12/2015 2:56 PM, Cameron Simpson wrote:
> On 12Mar2015 12:55, John Nagle  wrote:
>> I have working code from Python 2 which uses "pickle" to talk to a
>> subprocess via stdin/stdio.  I'm trying to make that work in Python
>> 3. First, the subprocess Python is invoked with the "-d' option, so
>> stdin and stdio are supposed to be unbuffered binary streams.
> 
> You shouldn't need to use unbuffered streams specifically. It should
> be enough to .flush() the output stream (at whichever end) after you
> have written the pickle data.

Doing that.

It's a repeat-transaction thing.  Main process sends pickeled
item to subprocess, subprocess reads item, subprocess does work,
subprocess writes picked item to parent.  This repeats.

I call writer.clear_memo() and set reader.memo = {} at the
end of each cycle, to clear Pickle's cache.  That all worked
fine in Python 2.  Are there any known problems with reusing
Python 3 "pickle"s streams?

The identical code works with Python 2.7.9; it's converted to Python
3 using "six" so I can run on both Python versions and look for
differences.  I'm using Pickle format 2, for compatibility.
(Tried 0, the ASCII format; it didn't help.)

> I'm skipping some of your discussion; I can see nothing wrong. I
> don't use pickle itself so aside from saying that your use seems to
> conform to the python 3 docs I can't comment more deeply. That said,
> I do use subprocess a fair bit.

 I'll have to put in more logging and see exactly what's going
over the pipes.

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


Python3 "pickle" vs. stdin/stdout - unable to get clean byte streams in Python 3

2015-03-12 Thread John Nagle
  I have working code from Python 2 which uses "pickle"
to talk to a subprocess via stdin/stdio.  I'm trying to
make that work in Python 3.

  First, the subprocess Python is invoked with the "-d' option,
so stdin and stdio are supposed to be unbuffered binary streams.
That was enough in Python 2, but it's not enough in Python 3.

The subprocess and its connections are set up with

  proc = subprocess.Popen(launchargs,stdin=subprocess.PIPE,
stdout=subprocess.PIPE, env=env)

  ...
  self.reader = pickle.Unpickler(self.proc.stdout)
  self.writer = pickle.Pickler(self.proc.stdin, 2)

after which I get

  result = self.reader.load()
TypeError: 'str' does not support the buffer interface

That's as far as traceback goes, so I assume this is
disappearing into C code.

OK, I know I need a byte stream.  I tried

  self.reader = pickle.Unpickler(self.proc.stdout.buffer)
  self.writer = pickle.Pickler(self.proc.stdin.buffer, 2)

That's not allowed.  The "stdin" and "stdout" that are
fields of "proc" do not have "buffer".  So I can't do that
in the parent process.  In the child, though, where
stdin and stdout come from "sys", "sys.stdin.buffer" is valid.
That fixes the ""str" does not support the buffer interface
error."  But now I get the pickle error "Ran out of input"
on the process child side.  Probably because there's a
str/bytes incompatibility somewhere.

So how do I get clean binary byte streams between parent
and child process?

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 2.7.9, 3.4.2 won't verify SSL cert for "verisign.com"

2015-02-17 Thread John Nagle
On 2/17/2015 3:42 PM, Laura Creighton wrote:
> Possibly this bug?
> https://bugs.launchpad.net/ubuntu/+source/openssl/+bug/1014640
> 
> Laura

  Probably that bug in OpenSSL.  Some versions of OpenSSL are
known to be broken for cases where there multiple valid certificate
trees.

  Python ships with its own copy of OpenSSL on Windows.  Tests
for "www.verisign.com"

Win7, x64:

   Python 2.7.9 with OpenSSL 1.0.1j 15 Oct 2014. FAIL
   Python 3.4.2 with OpenSSL 1.0.1i 6 Aug 2014.  FAIL
   openssl s_client -OpenSSL 1.0.1h 5 Jun 2014   FAIL

Ubuntu 14.04 LTS, using distro's versions of Python:

   Python 2.7.6 - test won't run, needs create_default_context
   Python 3.4.0 with OpenSSL 1.0.1f 6 Jan 2014.  FAIL
   openssl s_client  OpenSSL 1.0.1f 6 Jan 2014   PASS

   That's with the same cert file in all cases.
The OpenSSL version for Python programs comes from
ssl.OPENSSL_VERSION.

   The Linux situation has me puzzled.  On Linux,
Python is supposedly using the system version of OpenSSL.
The versions match.  Why do Python and the command line
client disagree?  Different options passed to OpenSSL
by Python?

   Here's the little test program:

http://www.animats.com/private/sslbug

   Please try that and let me know what happens on
other platforms.  Works with Python 2.7.9 or 3.x.

John Nagle





-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 2.7.9, 3.4.2 won't verify SSL cert for "verisign.com"

2015-02-17 Thread John Nagle
If I remove certs from my "cacert.pem" file passed to
create_default_context, the Python test program rejects domains
it will pass with the certs present.  It's using that file.

So that's not it.  It seems to be an OpenSSL or cert file
problem.  I can reproduce the problem with the OpenSSL command
line client:

   openssl s_client -connect www.verisign.com:443 -CAfile cacert.pem

fails for "www.verisign.com", where "cacert.pem" has been extracted
from Firefox's cert store.

   The error message from OpenSSL

Verify return code: 20 (unable to get local issuer certificate)

Try the same OpenSSL command for other domains ("google.com",
"python.org") and no errors are reported.  More later on this.

So it's not a Python level issue.  The only Python-specific
problem is that the Python library doesn't pass detailed
OpenSSL error codes through in exceptions.  The Python exception
text is "[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
(_ssl.c:581).", which is a generic message for most OpenSSL errors.

John Nagle

On 2/17/2015 12:00 AM, Laura Creighton wrote:
> I've seen something like this:
> 
> The requests module http://docs.python-requests.org/en/latest/
> ships with its own set of certificates "cacert.pem"
> and ignores the system wide ones -- so, for instance, adding certificates
> to /etc/ssl/certs on your debian or ubuntu system won't work.  I edited
> it by hand and then changed the REQUESTS_CA_BUNDLE environment variable
> to point to it.
> 
> Perhaps your problem is along the same lines?
> 
> Laura 
> 

-- 
https://mail.python.org/mailman/listinfo/python-list


Python 2.7.9, 3.4.2 won't verify SSL cert for "verisign.com"

2015-02-16 Thread John Nagle
Python 2.7.9, Windows 7 x64.
(also 3.4.2 on Win7, and 3.4.0 on Ubuntu 14.04)

   There's something about the SSL cert for "https://www.verisign.com";
that won't verify properly from Python.The current code looks
like this:

def testurlopen(host, certfile) :
port = httplib.HTTPS_PORT
sk = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
context = ssl.create_default_context(cafile=certfile)
sock = context.wrap_socket(sk, server_hostname=host)
try:
sock.connect((host,port))
except EnvironmentError as message :
print("Connection to \"%s\" failed: %s." % (host, message))

return False
print("Connection to \"%s\" succeeded." % (host,))
return True

Works for "python.org", "google.com", etc.  I can connect to and
dump the server's certificate for those sites.  But for "verisign.com"
and "www.verisign.com", I get

Connection to "verisign.com" failed: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:581).

The certificate file, "cacert.pem", comes from Mozila's list of
approved certificates, obtained from here:

http://curl.haxx.se/ca/cacert.pem

It has the cert for
"VeriSign Class 3 Public Primary Certification Authority - G5"
which is the root cert for "verisign.com".

After loading that cert file into an SSL context, I can dump the
context from Python with context.get_ca_certs()
and get this dict for that cert:

Cert: {'notBefore': u'Nov  8 00:00:00 2006 GMT',
'serialNumber': u'18DAD19E267DE8BB4A2158CDCC6B3B4A',
'notAfter': 'Jul 16 23:59:59 2036 GMT',
'version': 3L,
'subject': ((('countryName', u'US'),), (('organizationName', u'VeriSign,
Inc.'),),
(('organizationalUnitName', u'VeriSign Trust Network'),),
(('organizationalUnitName', u'(c) 2006 VeriSign, Inc. - For authorized
use only'),),
(('commonName', u'VeriSign Class 3 Public Primary Certification
Authority - G5'),)),
'issuer': ((('countryName', u'US
'),), (('organizationName', u'VeriSign, Inc.'),),
(('organizationalUnitName', u'VeriSign Trust Network'),),
(('organizationalUnitName', u'(c) 2006 VeriSign, Inc. - For authorized
use only'),), (('commonName', u'VeriSign Class 3 Public Primary
Certification Authority - G5'),))}

Firefox is happy with that cert.  The serial number of the root
cert matches the root cert Firefox displays.  So the root cert file
being used has the right cert for the cert chain back from
"www.verisign.com".

If I dump ssl.OPENSSL_VERSION from Python, I get "OpenSSL 1.0.1j 15 Oct
2014".  That's an OK version.

Something about that cert is unacceptable to the Python SSL module, but
what?  "CERTIFICATE VERIFY FAILED" doesn't tell me enough to
diagnose the problem.


John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


SSLsocket.getpeercert - request to return ALL the fields of the certificate.

2014-11-12 Thread John Nagle
  In each revision of "getpeercert", a few more fields are returned.
Python 3.2 added "issuer" and "notBefore".  Python 3.4 added
"crlDistributionPoints", "caIssuers", and OCSP URLS. But some fields
still aren't returned.  I happen to need CertificatePolicies, which
is how you distinguish DV, OV, and EV certs.

   Here's what you get now:

{'OCSP': ('http://EVSecure-ocsp.verisign.com',),
 'caIssuers': ('http://EVSecure-aia.verisign.com/EVSecure2006.cer',),
 'crlDistributionPoints':
('http://EVSecure-crl.verisign.com/EVSecure2006.crl',),
 'issuer': ((('countryName', 'US'),),
(('organizationName', 'VeriSign, Inc.'),),
(('organizationalUnitName', 'VeriSign Trust Network'),),
(('organizationalUnitName',
  'Terms of use at https://www.verisign.com/rpa (c)06'),),
(('commonName', 'VeriSign Class 3 Extended Validation SSL
CA'),)),
 'notAfter': 'Mar 22 23:59:59 2015 GMT',
 'notBefore': 'Feb 20 00:00:00 2014 GMT',
 'serialNumber': '69A7BC85C106DDE1CF4FA47D5ED813DC',
 'subject': ((('1.3.6.1.4.1.311.60.2.1.3', 'US'),),
 (('1.3.6.1.4.1.311.60.2.1.2', 'Delaware'),),
 (('businessCategory', 'Private Organization'),),
 (('serialNumber', '2927442'),),
 (('countryName', 'US'),),
 (('postalCode', '60603'),),
 (('stateOrProvinceName', 'Illinois'),),
 (('localityName', 'Chicago'),),
 (('streetAddress', '135 S La Salle St'),),
 (('organizationName', 'Bank of America Corporation'),),
 (('organizationalUnitName', 'Network Infrastructure'),),
 (('commonName', 'www.bankofamerica.com'),)),
 'subjectAltName': (('DNS', 'mobile.bankofamerica.com'),
('DNS', 'www.bankofamerica.com')),
 'version': 3}

   How about just returning ALL the remaining fields and finishing
the job?  Thanks.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Show off your Python chops and compete with others

2013-11-06 Thread John Nagle
On 11/6/2013 5:04 PM, Chris Angelico wrote:
> On Thu, Nov 7, 2013 at 11:00 AM, Nathaniel Sokoll-Ward
>  wrote:
>> Thought this group would appreciate this: 
>> www.metabright.com/challenges/python
>>
>> MetaBright makes skill assessments to measure how talented people are at 
>> different skills. And recruiters use MetaBright to find outrageously skilled 
>> job candidates.

   With tracking cookies blocked, you get 0 points.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python Front-end to GCC

2013-10-26 Thread John Nagle
On 10/25/2013 12:18 PM, Mark Janssen wrote:
>> As for the hex value for Nan who really gives a toss?  The whole point is
>> that you initialise to something that you do not expect to see.  Do you not
>> have a text book that explains this concept?
> 
> No, I don't think there is a textbook that explains such a concept of
> initializing memory to anything but 0 -- UNLESS you're from Stupid
> University.
> 
> Thanks for providing fodder...
> 
> Mark Janssen, Ph.D.
> Tacoma, WA

What a mess of a discussion.

First off, this is mostly a C/C++ issue, not a Python issue,
because Python generally doesn't let you see uninitialized memory.

Second, filling newly allocated memory with an illegal value
is a classic debugging technique.  Visual C/C++ uses it
when you build in debug mode.  Wikipedia has an explanation:

http://en.wikipedia.org/wiki/Magic_number_%28programming%29#Magic_debug_values

Microsoft Visual C++ uses 0xBAADF00D.  In valgrind, there's
a "-malloc-fill" option, and you can specify a hex value.

There's a performance penalty for filling large areas of memory
so it's usually done in debug mode only, and is sometimes causes
programs with bugs to behave differently when built in debug
vs. release mode.

Sigh.

John Nagle





-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Global Variable In Multiprocessing

2013-10-23 Thread John Nagle
On 10/22/2013 11:22 PM, Chandru Rajendran wrote:
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
> for the use of the addressee(s). If you are not the intended recipient, 
> please 
> notify the sender by e-mail and delete the original message. Further, you are 
> not 
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and 
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken 
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage 
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your 
> own virus checks before opening the e-mail or attachment. Infosys reserves 
> the 
> right to monitor and review the content of all messages sent to or from this 
> e-mail 
> address. Messages sent to or from this e-mail address may be stored on the 
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***

Because of the above restriction, we are unable to reply to your
question.

John Nagle
SiteTruth


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python Front-end to GCC

2013-10-23 Thread John Nagle
On 10/23/2013 12:25 AM, Philip Herron wrote:
> On Wednesday, 23 October 2013 07:48:41 UTC+1, John Nagle  wrote:
>> On 10/20/2013 3:10 PM, victorgarcia...@gmail.com wrote:
>> 
>>> On Sunday, October 20, 2013 3:56:46 PM UTC-2, Philip Herron
>>> wrote:
> Nagle replies:
>> 
>>>> Documentation can be found
>>>> http://gcc.gnu.org/wiki/PythonFrontEnd.
...
> 
> I think your analysis is probably grossly unfair for many reasons.
> But your entitled to your opinion.
> 
> Current i do not use Bohem-GC (I dont have one yet), 

You included it in your project:

http://sourceforge.net/p/gccpy/code/ci/master/tree/boehm-gc


> i re-use
> principles from gccgo in the _compiler_ not the runtime. At runtime
> everything is a gpy_object_t, everything does this. Yeah you could do
> a little of dataflow analysis for some really really specific code
> and very specific cases and get some performance gains. But the
> problem is that the libpython.so it was designed for an interpreter.
> 
> So first off your comparing a project done on my own to something
> like cPython loads of developers 20 years on my project or something
> PyPy has funding loads of developers.
> 
> Where i speed up is absolutely no runtime lookups on data access.
> Look at cPython its loads of little dictionaries. All references are
> on the Stack at a much lower level than C. All constructs are
> compiled in i can reuse C++ native exceptions in the whole thing. I
> can hear you shouting at the email already but the middle crap that a
> VM and interpreter have to do and fast lookup is _NOT_ one of them.
> If you truely understand how an interpreter works you know you cant
> do this
> 
> Plus your referencing really old code on sourceforge is another
> thing.

That's where you said to look:

http://gcc.gnu.org/wiki/PythonFrontEnd

"To follow gccpy development see: Gccpy SourceForge
https://sourceforge.net/projects/gccpy";

> And i dont want to put out bench marks (I would get so much
> shit from people its really not worth it) but it i can say it is
> faster than everything in the stuff i compile so far. So yeah... not
> only that but your referncing a strncmp to say no its slow yeah it
> isn't 100% ideal but in my current git tree i have changed that. 

So the real source code isn't where you wrote that it is?
Where is it, then?

> So i
> think its completely unfair to reference tiny things and pretend you
> know everything about my project.

If you wrote more documentation about what you're doing,
people might understand what you are doing.

> One thing people might find interesting is class i do data flow
> analysis to generate a complete type for that class and each member
> function is a compiled function like C++ but at a much lower level
> than C++.

It's not clear what this means.  Are you trying to determine, say,
which items are integers, lists, or specific object types?
Shed Skin tries to do that.  It's hard to do, but very effective
if you can do it.  In CPython, every time "x = a + b" is
executed, the interpreter has to invoke the general case for
"+", which can handle integers, floats, strings, NumPy, etc.
If you can infer types, and know it's a float, the run
time code can be float-specific and about three machine
instructions.

> The whole project has been about stripping out the crap
> needed to run user code and i have been successful so far but your
> comparing a in my spare time project to people who work on their
> stuff full time. With loads of people etc.

Shed Skin is one guy.

> Anyways i am just going to stay out of this from now but your email
> made me want to reply and rage.

You've made big claims without giving much detail.  So people
are trying to find out if you've done something worth paying
attention to.

John Nagle



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python Front-end to GCC

2013-10-22 Thread John Nagle
On 10/20/2013 3:10 PM, victorgarcia...@gmail.com wrote:
> On Sunday, October 20, 2013 3:56:46 PM UTC-2, Philip Herron wrote:
>> I've been working on GCCPY since roughly november 2009 at least in its
>> concept. It was announced as a Gsoc 2010 project and also a Gsoc 2011
>> project. I was mentored by Ian Taylor who has been an extremely big
>> influence on my software development carrer.
> 
> Cool!
> 
>> Documentation can be found http://gcc.gnu.org/wiki/PythonFrontEnd.
>> (Although this is sparse partialy on purpose since i do not wan't
>> people thinking this is by any means ready to compile real python
>> applications)
> 
> Is there any document describing what it can already compile and, if 
> possible, showing some benchmarks?

After reading through a vast amount of drivel below on irrelevant
topics, looking at the nonexistent documentation, and finally reading
some of the code, I think I see what's going on here.  Here's
the run-time code for integers:

http://sourceforge.net/p/gccpy/code/ci/master/tree/libgpython/runtime/gpy-object-integer.c

   The implementation approach seems to be that, at runtime,
everything is a struct which represents a general Python object.
The compiler is, I think, just cranking out general subroutine
calls that know nothing about type information. All the
type handling is at run time.  That's basically what CPython does,
by interpreting a pseudo-instruction set to decide which
subroutines to call.

   It looks like integers and lists have been implemented, but
not much else.  Haven't found source code for strings yet.
Memory management seems to rely on the Boehm garbage collector.
Much code seems to have been copied over from the GCC library
for Go. Go, though, is strongly typed at compile time.

   There's no inherent reason this "compiled" approach couldn't work,
but I don't know if it actually does. The performance has to be
very low.  Each integer add involves a lot of code, including two calls
of "strcmp (x->identifier, "Int")".  A performance win over CPython
is unlikely.

   Compare Shed Skin, which tries to infer the type of Python
objects so it can generate efficient type-specific C++ code.  That's
much harder to do, and has trouble with very dynamic code, but
what comes out is fast.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Complex literals (was Re: I am never going to complain about Python again)

2013-10-15 Thread John Nagle
On 10/10/2013 6:27 PM, Steven D'Aprano wrote:
> For what it's worth, there is no three-dimensional extension to complex 
> numbers, but there is a four-dimensional one, the quaternions or 
> hypercomplex numbers. They look like 1 + 2i + 3j + 4k, where i, j and k 
> are all distinct but i**2 == j**2 == k**2 == -1. Quaternions had a brief 
> period of popularity during the late 19th century but fell out of 
> popularity in the 20th. In recent years, they're making something of a 
> comeback, as using quaternions for calculating rotations is more 
> numerically stable than traditional matrix calculations.

I've done considerable work with quaternions in physics engines
for simulation.  Nobody in that area calls them "hypercomplex numbers".
The geometric concept is simple.  Consider an angle represented
as a 2-element unit vector.  It's a convenient angle representation.
It's homogeneous - there's no special case at 0 degrees.

Then upgrade to 3D.  You can represent latitude and longitude
as a 3-element unit vector.  (GPS systems do this; latitude and
longitude are only generated at the end, for output.)

Then upgrade to 4D.  Now you have a 4-element unit vector
that represents latitude, longitude, and heading. It can
also be thought of as a point on the surface of a 4D sphere,
although that isn't too useful.

If you have to numerically integrate torques to get
angular velocity, and angular velocity to get angular position,
quaternions are the way to go.  If you want to understand
all this, there's a good writeup in one of the Graphics Gems
books.

Unlike complex numbers, these quaternions are always unit vectors.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PID tuning.

2013-10-15 Thread John Nagle
On 10/14/2013 2:03 PM, Ben Finney wrote:
> Renato Barbosa Pim Pereira 
> writes:
> 
>> I am looking for some software for PID tuning that would take the
>> result of a step response, and calculates Td, Ti, Kp, any suggestion
>> or hint of where to start?, thanks.
> 
> Is this related to Python? What is “PID tuning”, and what have you
> tried already?

See
"http://sts.bwk.tue.nl/7y500/readers/.%5CInstellingenRegelaars_ExtraStof.pdf";

You might also try the OpenHRP3 forums.

John Nagle


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python was designed (was Re: Multi-threading in Python vs Java)

2013-10-14 Thread John Nagle
On 10/12/2013 3:37 PM, Chris Angelico wrote:
> On Sat, Oct 12, 2013 at 7:10 AM, Peter Cacioppi 
>  wrote:
>> Along with "batteries included" and "we're all adults", I think
>> Python needs a pithy phrase summarizing how well thought out it is.
>> That is to say, the major design decisions were all carefully
>> considered, and as a result things that might appear to be
>> problematic are actually not barriers in practice. My suggestion
>> for this phrase is "Guido was here".
> 
> "Designed".
> 
> You simply can't get a good clean design if you just let it grow by 
> itself, one feature at a time.

No, Python went through the usual design screwups.  Look at how
painful the slow transition to Unicode was, from just "str" to
Unicode strings, ASCII strings, byte strings, byte arrays,
16 and 31 bit character builds, and finally automatic switching
between rune widths. Old-style classes vs. new-style classes.  Adding a
boolean type as an afterthought (that was avoidable; C went through
that painful transition before Python was created).Operator "+"
as concatenation for built-in arrays but addition for NumPy
arrays.

Each of those reflects a design error in the type system which
had to be corrected.

The type system is now in good shape. The next step is to
make Python fast.  Python objects have dynamic operations suited
to a naive interpreter like CPython.  These make many compile
time optimizations hard. At any time, any thread can monkey-patch
any code, object, or variable in any other thread.  The ability
for anything to use "setattr()" on anything carries a high
performance price.  That's part of why Unladen Swallow failed
and why PyPy development is so slow.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: web scraping

2013-10-13 Thread John Nagle
On 10/12/2013 1:35 PM, dvgh...@gmail.com wrote:
> On Saturday, October 12, 2013 7:12:38 AM UTC-7, Ronald Routt wrote:
>> I am new to programming and trying to figure out python.
>> 
>> 
>> 
>> I am trying to learn which tools and tutorials I need to use along
>> with some good beginner tutorials in scraping the the web.  The end
>> result I am trying to come up with is scraping auto dealership
>> sites for the following:
>> 
>> 1.Name of dealership 
>> 2.  State where dealership is located 
>> 3.  Name of Owner, President or General Manager 
>> 4.  Email address of number 3 above
>> 5.  Phone number of dealership

   If you really want that data, and aren't just hacking, buy it.
There are data brokers that will sell it to you. D&B, FindTheCompany,
Infot, etc.

   Sounds like you want to spam. Don't.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What version of glibc is Python using?

2013-10-13 Thread John Nagle
On 10/12/2013 1:28 PM, Ian Kelly wrote:
> Reading the docs more closely, I think that the function is actually
> working as intended.  It says that it determines "the libc version
> against which the file executable (defaults to the Python interpreter)
> is linked" -- or in other words, the minimum compatible libc version,
> NOT the libc version that is currently loaded.

   A strange interpretation.

> So I think that a patch to replace this with gnu_get_libc_version()
> should be rejected, since it would change the documented behavior of
> the function.  It may be worth considering adding an additional
> function that matches the OP's expectations, but since it would just
> be a simple ctypes wrapper it is probably best done by the user.

   Ah, the apologist approach.

   The documentation is badly written.  The next line,
"Note that this function has intimate knowledge of how different libc
versions add symbols to the executable is probably only usable for
executables compiled using gcc" isn't even a sentence.

   The documentation needs to be updated.  Please submit a patch.

John Nagle


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What version of glibc is Python using?

2013-10-13 Thread John Nagle
On 10/12/2013 4:43 AM, Ian Kelly wrote:
> On Sat, Oct 12, 2013 at 2:46 AM, Terry Reedy  wrote:
>> On 10/12/2013 3:53 AM, Christian Gollwitzer wrote:
>>>
>>> That function is really bogus. It states itself, that it has "intimate
>>> knowledge of how different libc versions add symbols to the executable
>>> and thus is probably only useable for executables compiled using gcc"
>>> which is just another way of saying "it'll become outdated and broken
>>> soon". It's not even done by reading the symbol table, it opens the
>>> binary and matches a RE *shocked* I would have expected such hacks in a
>>> shell script.
>>>
>>> glibc has a function for this:
>>>
>>>  gnu_get_libc_version ()
>>>
>>> which should be used.
>>
>>
>> So *please* submit a patch with explanation.
> 
> Easier said than done.  The module is currently written in pure
> Python, and the comment "Note: Please keep this module compatible to
> Python 1.5.2" would appear to rule out the use of ctypes to call the
> glibc function.  I wonder though whether that comment is really still
> appropriate.

   What a mess.  It only "works" on Linux,
it only works with GCC, and there it returns bogus results.

   Amusingly, there was a fix in 2011 to speed up
platform.libc_ver () by having it read bigger blocks:

http://code.activestate.com/lists/python-checkins/100109/

It still got the wrong answer, but it's faster.

There's a bug report that it doesn't work right on Solaris:

http://comments.gmane.org/gmane.comp.python.gis/870

It fails on Cygwin ("wontfix")
http://bugs.python.org/issue928297

The result under GenToo is bogus:

http://archives.gentoo.org/gentoo-user/msg_b676eccb5fc00cb051b7423db1b5a9f7.xml

There are several programs which fetch this info and
display it, or send it in with crash reports, but
I haven't found any that actually use the result
for anything.  I'd suggest deprecating it and
documenting that.

John Nagle



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: What version of glibc is Python using?

2013-10-12 Thread John Nagle
On 10/11/2013 11:50 PM, Christian Gollwitzer wrote:
> Am 12.10.13 08:34, schrieb John Nagle:
>> I'm trying to find out which version of glibc Python is using.
>> I need a fix that went into glibc 2.10 back in 2009.
>> (http://udrepper.livejournal.com/20948.html)
>>
>> So I try the recommended way to do this, on a CentOS server:
>>
>> /usr/local/bin/python2.7
>> Python 2.7.2 (default, Jan 18 2012, 10:47:23)
>> [GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import platform
>>>>> platform.libc_ver()
>> ('glibc', '2.3')
> 
> Try
> 
> ldd /usr/local/bin/python2.7
> 
> Then execute the reported libc.so, which gives you some information.
> 
> Christian
> 
Thanks for the quick reply. That returned:

 /lib64/libc.so.6
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.6 20110731 (Red Hat 4.4.6-3).
Compiled on a Linux 2.6.32 system on 2011-12-06.
Available extensions:
The C stubs add-on version 2.1.2.
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

Much more helpful.  I have a good version of libc, and
can now work on my DNS resolver problem.

Why is the info from "plaform.libc_ver()" so bogus?

John Nagle

-- 
https://mail.python.org/mailman/listinfo/python-list


What version of glibc is Python using?

2013-10-11 Thread John Nagle
I'm trying to find out which version of glibc Python is using.
I need a fix that went into glibc 2.10 back in 2009.
(http://udrepper.livejournal.com/20948.html)

So I try the recommended way to do this, on a CentOS server:

/usr/local/bin/python2.7
Python 2.7.2 (default, Jan 18 2012, 10:47:23)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.libc_ver()
('glibc', '2.3')

This is telling me that the Python distribution built in 2012,
with a version of GCC released April 16, 2011, is using
glibc 2.3, released in October 2002.  That can't be right.

I tried this on a different Linux machine, a desktop running
Ubuntu 12.04 LTS:

Python 2.7.3 (defualt, April 10 2013, 06:20:15)
[GCC 4.6.3] on linux2
('glibc', '2.7')

That version of glibc is from October 2007.

Where are these ancient versions coming from?  They're
way out of sync with the GCC version.

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Applying 4x4 transformation to 3-element vector with numpy

2013-10-08 Thread John Nagle
On 10/8/2013 10:36 PM, Christian Gollwitzer wrote:
> Dear John,
> 
> Am 09.10.13 07:28, schrieb John Nagle:
>> This is the basic transformation of 3D graphics.  Take
>> a 3D point, make it 4D by adding a 1 on the end, multiply
>> by a transformation matrix to get a new 4-element vector,
>> discard the last element.
>>
>> Is there some way to do that in numpy without
>> adding the extra element and then discarding it?
>>
> 
> if you can discard the last element, the matrix has a special structure:
> It is an affine transform, where the last row is unity, and it can be
> rewritten as
> 
> A*x+b
> 
> where A is the 3x3 upper left submatrix and b is the column vector. You
> can do this by simple slicing - with C as the 4x4 matrix it is something
> like
> 
> dot(C[0:3, 0:3], x) + C[3, 0:3]
> 
> (untested, you need to check if I got the indices right)
> 
> *IF* however, your transform is perspective, then this is incorrect -
> you must divide the result vector by the last element before discarding
> it, if it is a 3D-point. For a 3D-vector (enhanced by a 0) you might
> still find a shortcut.

I only need affine transformations.  This is just moving
the coordinate system of a point, not perspective rendering.
I have to do this for a lot of points, and I'm hoping numpy
has some way to do this without generating extra garbage on the
way in and the way out.

I've done this before in C++.

John Nagle


-- 
https://mail.python.org/mailman/listinfo/python-list


Applying 4x4 transformation to 3-element vector with numpy

2013-10-08 Thread John Nagle
   This is the basic transformation of 3D graphics.  Take
a 3D point, make it 4D by adding a 1 on the end, multiply
by a transformation matrix to get a new 4-element vector,
discard the last element.

   Is there some way to do that in numpy without
adding the extra element and then discarding it?

John Nagle
-- 
https://mail.python.org/mailman/listinfo/python-list


Python FTP timeout value not effective

2013-09-02 Thread John Nagle
I'm reading files from an FTP server at the U.S. Securities and
Exchange Commission.  This code has been running successfully for
years.  Recently, they imposed a consistent connection delay
of 20 seconds at FTP connection, presumably because they're having
some denial of service attack.  Python 2.7 urllib2 doesn't
seem to use the timeout specified.  After 20 seconds, it
gives up and times out.

Here's the traceback:

Internal error in EDGAR update: 

  File "./edgar/edgarnetutil.py", line 53, in urlopen
return(urllib2.urlopen(url,timeout=timeout))
  File "/opt/python27/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "/opt/python27/lib/python2.7/urllib2.py", line 394, in open
response = self._open(req, data)
  File "/opt/python27/lib/python2.7/urllib2.py", line 412, in _open
'_open', req)
  File "/opt/python27/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
  File "/opt/python27/lib/python2.7/urllib2.py", line 1379, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
  File "/opt/python27/lib/python2.7/urllib2.py", line 1400, in connect_ftp
fw = ftpwrapper(user, passwd, host, port, dirs, timeout)
  File "/opt/python27/lib/python2.7/urllib.py", line 860, in __init__
self.init()
  File "/opt/python27/lib/python2.7/urllib.py", line 866, in init
self.ftp.connect(self.host, self.port, self.timeout)
  File "/opt/python27/lib/python2.7/ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port),
self.timeout)
  File "/opt/python27/lib/python2.7/socket.py", line 571, in
create_connection
raise err
URLError: 

Periodic update completed in 21.1 seconds.
--

Here's the relevant code:

TIMEOUTSECS = 60## give up waiting for server after 60 seconds
...
def urlopen(url,timeout=TIMEOUTSECS) :
if url.endswith(".gz") :# gzipped file, must decompress first
nd = urllib2.urlopen(url,timeout=timeout)   # get connection
... # (NOT .gz FILE, DOESN'T TAKE THIS PATH)
else :
return(urllib2.urlopen(url,timeout=timeout)) # (OPEN FAILS)


TIMEOUTSECS used to be 20 seconds, and I increased it to 60. It didn't
help.

This isn't an OS problem. The above traceback was on a Linux system.
On Windows 7, it fails with

"URLError: "

But in both cases, the command line FTP client will work, after a
consistent 20 second delay before the login prompt.  So the
Python timeout parameter isn't working.

John Nagle



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [RELEASED] Python 2.7.5

2013-06-03 Thread John Nagle
On 5/15/2013 9:19 PM, Benjamin Peterson wrote:
> It is my greatest pleasure to announce the release of Python 2.7.5.
> 
> 2.7.5 is the latest maintenance release in the Python 2.7 series.

Thanks very much.  It's important that Python 2.x be maintained.

3.x is a different language, with different libraries, and lots of
things that still don't work.  Many old applications will never
be converted.

    John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Getting USB volume serial number from inserted device on OSX

2013-04-04 Thread John Nagle
On 4/2/2013 3:18 PM, Sven wrote:
> Hello,
> 
> I am using Python 2.7 with pyobjc on Lion and NSNotification center to
> monitor any inserted USB volumes. This works fine.
> 
> I've also got some ideas how to get a device's serial number, but these
> involve just parsing all the USB devices ('system_profiler SPUSBDataType'
> command). However I'd like to specifically query the inserted device only
> (the one creating the notification) rather than checking the entire USB
> device list. The latter seems a little inefficient and "wrong".

   That would be useful to have as a portable function for all USB
devices.  Serial port devices are particularly annoying, because their
port number is somewhat random when there's more than one, and changes
on hot-plugging.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Distributing a Python program hell

2013-04-03 Thread John Nagle
I'm struggling with radio hams who are trying to get my
antique Teletype program running.  I hate having to write
instructions like this:

  Installation instructions (Windows):

  Download and install Python 2.7 (32-bit) if not already installed.
  (Python 2.6 or 2.7 is required; "pyserial" will not work correctly on
  older versions, and "feedparser" is not supported in 3.x versions.)

  Install the Python module "setuptools" from the Python Package Index.
  (Needed by other installers. Has a Windows installer.)

  Install the Python module "feedparser" from Google Code.
  (Unpack ZiP  file, run "setup.py install")

  Install the Python module "pyserial" from SourceForge.
  (Windows installer, but 32-bit only)

  Install the Python module "pygooglevoice" from Google Code.
  (Requires 7Zip to unpack the .tar.gz file. Then "setup.py install")

  Download "BaudotRSS" from SourceForge. (ZIP file, put in your
  chosen directory for this program.)

  Run: python baudotrss.py --help

I'm thinking of switching to Go.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unhelpful traceback

2013-03-07 Thread John Nagle
On 3/7/2013 10:42 AM, John Nagle wrote:
> On 3/7/2013 5:10 AM, Dave Angel wrote:
>> On 03/07/2013 01:33 AM, John Nagle wrote:
>>> Here's a traceback that's not helping:
>>>
>>
>> A bit more context would be helpful.  Starting with Python version.
> 
> Sorry, Python 2.7.

The trouble comes from here:

decoder = codecs.getreader('utf-8')  # UTF-8 reader
with decoder(infdraw,errors="replace") as infd :

It's not the CSV module that's blowing up.  If I just feed the
raw unconverted bytes from the ZIP module into the CSV module,
the CSV module runs without complaint.

I've tried 'utf-8', 'ascii', and 'windows-1252' as codecs.
They all blow up. 'errors="replace"' doesn't help.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unhelpful traceback

2013-03-07 Thread John Nagle
On 3/7/2013 5:10 AM, Dave Angel wrote:
> On 03/07/2013 01:33 AM, John Nagle wrote:

>>
>> "infdraw" is a stream from the zip module, create like this:
>>
>>  with inzip.open(zipelt.filename,"r") as infd :
> 
> You probably need a 'rb' rather than 'r', since the file is not ASCII.
> 
>>  self.dofilecsv(infile, infd)
>>
>> This works for data records that are pure ASCII, but as soon as some
>> non-ASCII character comes through, it fails.

   No, the ZIP module gives you back the bytes you
put in.  "rb" is not accepted there:

  File "InfoCompaniesHouse.py", line 197, in dofilezip
with inzip.open(zipelt.filename,"rb") as infd :# do this
file
  File "C:\python27\lib\zipfile.py", line 872, in open
raise RuntimeError, 'open() requires mode "r", "U", or "rU"'
RuntimeError: open() requires mode "r", "U", or "rU"

   "b" for files is about end of line handling (CR LF -> LF), anyway.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unhelpful traceback

2013-03-07 Thread John Nagle
On 3/7/2013 5:10 AM, Dave Angel wrote:
> On 03/07/2013 01:33 AM, John Nagle wrote:
>> Here's a traceback that's not helping:
>>
> 
> A bit more context would be helpful.  Starting with Python version.

Sorry, Python 2.7.

> 
> If that isn't enough, then please give the whole context, such as where
> zipelt and filename came from.  And don't forget to specify Python
> version.  Version 3.x treats nonbinary files very differently than 2.x

 Here it is, with some email wrap problems.

John Nagle


def dofilecsv(self, infilename, infdraw) :
"""
Loader for Companies House company data, with files already open.
"""
self.logger.info('Converting "%s"' % (infilename, ))# log
(pathpart, filepart) = os.path.split(infilename)#
split off file part to construct outputfile)
(outfile, ext) = os.path.splitext(filepart) #
remove extension
outfile += ".sql"   #
add SQL suffix
outfilename = os.path.abspath(os.path.join(self.options.destdir,
outfile))
#   ***NEED TO INSURE UNIQUE OUTFILENAME EVEN IF DUPLICATED IN
ZIP FILES***
decoder = codecs.getreader('utf-8') #
UTF-8 reader
with decoder(infdraw,errors="replace") as infd :
with codecs.open(outfilename, encoding='utf-8', mode='w') as
outfd :
headerline = infd.readline()#
read header line
self.doheaderline(headerline)   #
process header line
reader = csv.reader(infd, delimiter=',', quotechar='"')
# CSV file
for fields in reader :  #
read entire CSV file
self.doline(outfd, fields)  #
copy fields
self.logstats(infilename)   #
log statistics of this file

def dofilezip(self, infilename) :
"""
Do a ZIP file containing CSV files.
"""
try :
inzip = zipfile.ZipFile(infilename, "r", allowZip64=True)
# try to open
zipdir = inzip.infolist()   # get
objects in file
for zipelt in zipdir :  # for all
objects in file
self.logger.debug('ZIP file "%s" contains "%s".' %
(infilename, zipelt.filename))
(infile, ext) = os.path.splitext(zipelt.filename) #
remove extension
if ext.lower() == ".csv" :   # if a CSV file
with inzip.open(zipelt.filename,"r") as infd :
  # do this file
self.dofilecsv(infile, infd)# as a CSV file
else :
self.logger.error('Non-CSV file in ZIP file: "%s"' %
(zipelt.filename,))
self.errorcount += 1# tally

except zipfile.BadZipfile as message :  # if trouble
self.logger.error('Bad ZIP file: "%s"' % (infilename,))  #
note trouble
self.errorcount += 1# tally

def dofile(self, infilename) :
"""
Loader for Companies House company data
"""
(sink, ext) = os.path.splitext(infilename) # get extension
if ext == ".zip" :   # if .ZIP file
self.dofilezip(infilename)  # do ZIP file
elif ext == ".csv" :
self.logger.info('Converting "%s"' % (infilename,))# log
with open(infilename, "rb") as infd :
self.dofilecsv(infilename, infd)# do
self.logstats(infilename)   # log statistics
of this file
else :
self.logger.error('File of unexpected type (not .csv or
.zip): %s ' % (infilename,))
self.errorcount += 1



-- 
http://mail.python.org/mailman/listinfo/python-list


Unhelpful traceback

2013-03-06 Thread John Nagle
Here's a traceback that's not helping:

Traceback (most recent call last):
  File "InfoCompaniesHouse.py", line 255, in 
main()
  File "InfoCompaniesHouse.py", line 251, in main
loader.dofile(infile)   # load this file
  File "InfoCompaniesHouse.py", line 213, in dofile
self.dofilezip(infilename)  # do ZIP file
  File "InfoCompaniesHouse.py", line 198, in dofilezip
self.dofilecsv(infile, infd)# as a CSV file
  File "InfoCompaniesHouse.py", line 182, in dofilecsv
for fields in reader :  # read entire
CSV file
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in
position 14: ordinal not in range(128)

This is wierd, becuase "for fields in reader" isn't directly
doing a decode. That's further down somewhere, and the backtrace
didn't tell me where.

The program is converting some .CSV files that come packaged in .ZIP
files.  The files are big, so rather than expanding them, they're
read directly from the ZIP files and processed through the ZIP
and CSV modules.

Here's the code that's causing the error above:

decoder = codecs.getreader('utf-8')
with decoder(infdraw,errors="replace") as infd :
with codecs.open(outfilename, encoding='utf-8', mode='w') as
outfd :
headerline = infd.readline()
self.doheaderline(headerline)
reader = csv.reader(infd, delimiter=',', quotechar='"')
for fields in reader :
pass

Normally, the "pass" is a call to something that
uses the data, but for test purposes, I put a "pass" in there.  It still
fails.   With that "pass", nothing is ever written to the
output file, and no "encoding" should be taking place.

"infdraw" is a stream from the zip module, create like this:

with inzip.open(zipelt.filename,"r") as infd :
self.dofilecsv(infile, infd)

This works for data records that are pure ASCII, but as soon as some
non-ASCII character comes through, it fails.

Where is the error being generated?  I'm not seeing any place
where there's a conversion to ASCII.  Not even a print.

John Nagle




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing ISO date/time strings - where did the parser go?

2012-09-08 Thread John Nagle
On 9/8/2012 5:20 PM, John Gleeson wrote:
> 
> On 2012-09-06, at 2:34 PM, John Nagle wrote:
>>  Yes, it should.  There's no shortage of implementations.
>> PyPi has four.  Each has some defect.
>>
>>   PyPi offers:
>>
>> iso8601 0.1.4 Simple module to parse ISO 8601 dates
>> iso8601.py 0.1dev Parse utilities for iso8601 encoding.
>> iso8601plus 0.1.6 Simple module to parse ISO 8601 dates
>> zc.iso8601 0.2.0 ISO 8601 utility functions
> 
> 
> Here are three more on PyPI you can try:
> 
> iso-8601 0.2.3   Flexible ISO 8601 parser...
> PySO8601 0.1.7   PySO8601 aims to parse any ISO 8601 date...
> isodate 0.4.8An ISO 8601 date/time/duration parser and formater
> 
> All three have been updated this year.

   There's another one inside feedparser, and there used to be
one in the xml module.

   Filed issue 15873: "datetime" cannot parse ISO 8601 dates and times
http://bugs.python.org/issue15873

   This really should be handled in the standard library, instead of
everybody rolling their own, badly.  Especially since in Python 3.x,
there's finally a useful "tzinfo" subclass for fixed time zone
offsets.  That provides a way to directly represent ISO 8601 date/time
strings with offsets as "time zone aware" date time objects.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing ISO date/time strings - where did the parser go?

2012-09-06 Thread John Nagle
On 9/6/2012 12:51 PM, Paul Rubin wrote:
> John Nagle  writes:
>> There's an iso8601 module on PyPi, but it's abandoned; it hasn't been
>> updated since 2007 and has many outstanding issues.
> 
> Hmm, I have some code that uses ISO date/time strings and just checked
> to see how I did it, and it looks like it uses iso8601-0.1.4-py2.6.egg .
> I don't remember downloading that module (I must have done it and
> forgotten).  I'm not sure what its outstanding issues are, as it works
> ok in the limited way I use it.
> 
> I agree that this functionality ought to be in the stdlib.

   Yes, it should.  There's no shortage of implementations.
PyPi has four.  Each has some defect.

   PyPi offers:

iso8601 0.1.4   Simple module to parse ISO 8601 dates
iso8601.py 0.1dev   Parse utilities for iso8601 encoding.
iso8601plus 0.1.6   Simple module to parse ISO 8601 dates
zc.iso8601 0.2.0ISO 8601 utility functions

Unlike CPAN, PyPi has no quality control.

Looking at the first one, it's in Google Code.

http://code.google.com/p/pyiso8601/source/browse/trunk/iso8601/iso8601.py

The first bug is at line 67.  For a timestamp with a "Z"
at the end, the offset should always be zero, regardless of the default
timezone.  See "http://en.wikipedia.org/wiki/ISO_8601";.
The code uses the default time zone in that case, which is wrong.
So don't call that code with your local time zone as the default;
it will return bad times.

Looking at the second one, it's on github:

https://github.com/accellion/iso8601.py/blob/master/iso8601.py

Giant regular expressions!  The code to handle the offset
is present, but it doesn't make the datetime object a
timezone-aware object.  It returns a naive object in UTC.

The third one is at

https://github.com/jimklo/pyiso8601plus

This is a fork of the first one, because the first one is abandonware.
The bug in the first one, mentioned above, isn't fixed.  However, if
a time zone is present, it does return an "aware" datetime object.

The fourth one is the Zope version.  This brings in the pytz
module, which brings in the Olsen database of named time zones and
their historical conversion data. None of that information is
used, or necessary, to parse ISO dates and times.  Somebody
just wanted the pytz.fixedOffset() function, which does something
datetime already does.

(For all the people who keep saying "use strptime", that doesn't
handle time zone offsets at all.)

John Nagle


-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing ISO date/time strings - where did the parser go?

2012-09-06 Thread John Nagle
In Python 2.7:

   I want to parse standard ISO date/time strings such as

2012-09-09T18:00:00-07:00

into Python "datetime" objects.  The "datetime" object offers
an output method , datetimeobj.isoformat(), but not an input
parser.  There ought to be

classmethod datetime.fromisoformat(s)

but there isn't.  I'd like to avoid adding a dependency on
a third party module like "dateutil".

The "Working with time" section of the Python wiki is so
ancient it predates "datetime", and says so.

There's an iso8601 module on PyPi, but it's abandoned; it hasn't been
updated since 2007 and has many outstanding issues.

There are mentions of "xml.utils.iso8601.parse" in
various places, but the "xml" module that comes
with Python 2.7 doesn't have xml.utils.

http://www.seehuhn.de/pages/pdate
says:

"Unfortunately there is no easy way to parse full ISO 8601 dates using
the Python standard library."

It looks like this was taken out of "xml" at some point,
but not moved into "datetime".

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python 6 compilation failure on RHEL

2012-08-20 Thread John Nagle
On 8/20/2012 2:50 PM, Emile van Sebille wrote:
> On 8/20/2012 1:55 PM Walter Hurry said...
>> On Mon, 20 Aug 2012 12:19:23 -0700, Emile van Sebille wrote:
>>
>>> Package dependencies.  If the OP intends to install a package that
>>> doesn't support other than 2.6, you install 2.6.
>>
>> It would be a pretty poor third party package which specified Python 2.6
>> exactly, rather than (say) "Python 2.6 or later, but not Python 3"

After a thread of clueless replies, it's clear that nobody
responding actually read the build log.  Here's the problem:

  Failed to find the necessary bits to build these modules:
bsddb185
dl
imageop
sunaudiodev

What's wrong is that the Python 2.6 build script is looking for
some antiquated packages that aren't in a current RHEL.  Those
need to be turned off.

This is a known problem (see
http://pythonstarter.blogspot.com/2010/08/bsddb185-sunaudiodev-python-26-ubuntu.html)
but, unfortunately, the site with the patch for it
(http://www.lysium.de/sw/python2.6-disable-old-modules.patch)
is no longer in existence.  

But someone archived it on Google Code, at

http://code.google.com/p/google-earth-enterprise-compliance/source/browse/trunk/googleclient/geo/earth_enterprise/src/third_party/python/python2.6-disable-old-modules.patch

so if you apply that patch to the setup.py file for Python 2.6, that
ought to help.

You might be better off building Python 2.7, but you asked about 2.6.

John Nagle



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: On-topic: alternate Python implementations

2012-08-06 Thread John Nagle
On 8/4/2012 7:19 PM, Steven D'Aprano wrote:
> On Sat, 04 Aug 2012 18:38:33 -0700, Paul Rubin wrote:
> 
>> Steven D'Aprano  writes:
>>> Runtime optimizations that target the common case, but fall back to
>>> unoptimized code in the rare cases that the optimization doesn't apply,
>>> offer the opportunity of big speedups for most code at the cost of
>>> trivial slowdowns when you do something unusual.
>>
>> The problem is you can't always tell if the unusual case is being
>> exercised without an expensive dynamic check, which in some cases must
>> be repeated in every iteration of a critical inner loop, even though it
>> turns out that the program never actually uses the unusual case.

   There are other approaches. PyPy uses two interpreters and a JIT
compiler to handle the hard cases.  When code does something unexpected
to other code, the backup interpreter is used to get control out of
the trouble spot so that the JIT compiler can then recompile the
code.  (I think; I've read the paper but haven't looked at the
internals.)

   This is hard to implement and hard to get right.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Implicit conversion to boolean in if and while statements

2012-07-17 Thread John Nagle

On 7/15/2012 1:34 AM, Andrew Berg wrote:

This has probably been discussed before, but why is there an implicit
conversion to a boolean in if and while statements?

if not None:
print('hi')
prints 'hi' since bool(None) is False.

If this was discussed in a PEP, I would like a link to it. There are so
many PEPs, and I wouldn't know which ones to look through.

Converting 0 and 1 to False and True seems reasonable, but I don't see
the point in converting other arbitrary values.


   Because Boolean types were an afterthought in Python.  See PEP 285.
If a language starts out with a Boolean type, it tends towards
Pascal/Ada/Java semantics in this area.  If a language backs
into needing a Boolean type, as Python and C did, it tends to have
the somewhat weird semantics of a language which can't quite decide 
what's a Boolean.  C and C++ have the same problem, for exactly the

same reason - boolean types were an afterthought there, too.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: How to safely maintain a status file

2012-07-09 Thread John Nagle

On 7/8/2012 2:52 PM, Christian Heimes wrote:

You are contradicting yourself. Either the OS is providing a fully
atomic rename or it doesn't. All POSIX compatible OS provide an atomic
rename functionality that renames the file atomically or fails without
loosing the target side. On POSIX OS it doesn't matter if the target exists.


Rename on some file system types (particularly NFS) may not be atomic.


You don't need locks or any other fancy stuff. You just need to make
sure that you flush the data and metadata correctly to the disk and
force a re-write of the directory inode, too. It's a standard pattern on
POSIX platforms and well documented in e.g. the maildir RFC.

You can use the same pattern on Windows but it doesn't work as good.


  That's because you're using the wrong approach. See how to use
ReplaceFile under Win32:

http://msdn.microsoft.com/en-us/library/aa365512%28VS.85%29.aspx

Renaming files is the wrong way to synchronize a
crawler.  Use a database that has ACID properties, such as
SQLite.  Far fewer I/O operations are required for small updates.
It's not the 1980s any more.

I use a MySQL database to synchronize multiple processes
which crawl web sites.  The tables of past activity are InnoDB
tables, which support transactions.  The table of what's going
on right now is a MEMORY table.  If the database crashes, the
past activity is recovered cleanly, the MEMORY table comes back
empty, and all the crawler processes lose their database
connections, abort, and are restarted.  This allows multiple
servers to coordinate through one database.

John Nagle




--
http://mail.python.org/mailman/listinfo/python-list


Re: Socket code not executing properly in a thread (Windows)

2012-07-07 Thread John Nagle

On 7/8/2012 3:55 AM, Andrew D'Angelo wrote:

Hi, I've been writing an IRC chatbot that an relay messages it receives as
an SMS.


   We have no idea what IRC module you're using.


As it stands, I can retrieve and parse SMSs from Google Voice perfectly


   The Google Voice code you have probably won't work once you have
enough messages stored that Google Voice returns them on multiple
pages.  You have to read all the pages.  If there's any significant
amount of traffic, the completed messages have to be moved or deleted,
or each polling cycle returns more data than the last one.

   Google Voice isn't a very good SMS gateway.  I used to use it,
but switched to Twilio (which costs, but works) two years ago.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: simpler increment of time values?

2012-07-05 Thread John Nagle

On 7/4/2012 5:29 PM, Vlastimil Brom wrote:

Hi all,
I'd like to ask about the possibilities to do some basic manipulation
on timestamps - such as incrementing a given time (hour.minute -
string) by some minutes.
Very basic notion of "time" is assumed, i.e. dateless,
timezone-unaware, DST-less etc.
I first thought, it would be possible to just add a timedelta to a
time object, but, it doesn't seem to be the case.


   That's correct.  A datetime.time object is a time within a day.
A datetime.date object is a date without a time.  A datetime.datetime
object contains both.

  You can add a datetime.timedelta object to a datetime.datetime
object, which will yield a datetime.datetime object.

  You can also call time.time(), and get the number of seconds
since the epoch (usually 1970-01-01 00:00:00 UTC). That's just
a number, and you can do arithmetic on that.

  Adding a datetime.time to a datetime.timedelta isn't that
useful.  It would have to return a value error if the result
crossed a day boundary.

    John Nagle


--
http://mail.python.org/mailman/listinfo/python-list


Re: when "normal" parallel computations in CPython will be implemented at last?

2012-07-02 Thread John Nagle

On 7/1/2012 10:51 AM, dmitrey wrote:

hi all,
are there any information about upcoming availability of parallel
computations in CPython without modules like  multiprocessing? I mean
something like parallel "for" loops, or, at least, something without
forking with copying huge amounts of RAM each time and possibility to
involve unpiclable data (vfork would be ok, but AFAIK it doesn't work
with CPython due to GIL).

AFAIK in PyPy some progress have been done (
http://morepypy.blogspot.com/2012/06/stm-with-threads.html )

Thank you in advance, D.



   It would be "un-Pythonic" to have real concurrency in Python.
You wouldn't be able to patch code running in one thread from
another thread.  Some of the dynamic features of Python
would break.   If you want fine-grained concurrency, you need
controlled isolation between concurrent tasks, so they interact
only at well-defined points.  That's un-Pythonic.

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: PySerial could not open port COM4: [Error 5] Access is denied - please help

2012-06-26 Thread John Nagle

On 6/26/2012 9:12 PM, Adam wrote:

Host OS:Ubuntu 10.04 LTS
Guest OS:Windows XP Pro SP3


I am able to open port COM4 with Terminal emulator.

So, what can cause PySerial to generate the following error ...

C:\Wattcher>python wattcher.py
Traceback (most recent call last):
   File "wattcher.py", line 56, in 
 ser.open()
   File "C:\Python25\Lib\site-packages\serial\serialwin32.py", line 56, in
open
 raise SerialException("could not open port %s: %s" % (self.portstr,
ctypes.WinError()))
serial.serialutil.SerialException: could not open port COM4: [Error 5]
Access is denied.


Are you trying to access serial ports from a virtual machine?
Which virtual machine environment?  Xen?  VMware? QEmu?  VirtualBox?
I wouldn't expect that to work in most of those.

What is "COM4", anyway?   Few machines today actually have four
serial ports.  Is some device emulating a serial port?

John Nagle


--
http://mail.python.org/mailman/listinfo/python-list


Re: Why has python3 been created as a seperate language where there is still python2.7 ?

2012-06-26 Thread John Nagle

On 6/25/2012 1:36 AM, Stefan Behnel wrote:

gmspro, 24.06.2012 05:46:

Why has python3 been created as a seperate language where there is still 
python2.7 ?



The intention of Py3 was to deliberately break backwards compatibility in
order to clean up the language. The situation is not as bad as you seem to
think, a huge amount of packages have been ported to Python 3 already
and/or work happily with both language dialects.


The syntax changes in Python 3 are a minor issue for
serious programmers.  The big headaches come from packages that
aren't being ported to Python 3 at all.  In some cases, there's
a replacement package from another author that performs the
same function, but has a different API.  Switching packages
involves debugging some new package with, probably, one
developer and a tiny user community.

The Python 3 to MySQL connection is still a mess.
The original developer of MySQLdb doesn't want to support
Python 3.  There's "pymysql", but it hasn't been updated
since 2010 and has a long list of unfixed bugs.
There was a "MySQL-python-1.2.3-py3k" port by a third party,
but the domain that hosted it 
("http://www.elecmor.mooo.com/python/MySQL-python-1.2.3-py3k.zip";) is 
dead.  There's

MySQL for Python 3 (https://github.com/davispuh/MySQL-for-Python-3)
but it doesn't work on Windows.  MySQL Connector
(https://code.launchpad.net/myconnpy) hasn't been updated in a
while, but at least has some users.  OurSQL has a different
API than MySQLdb, and isn't quite ready for prime time yet.

    That's why I'm still on Python 2.7.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Internationalized domain names not working with URLopen

2012-06-13 Thread John Nagle

On 6/12/2012 11:42 PM, Andrew Berg wrote:

On 6/13/2012 1:17 AM, John Nagle wrote:

What does "urllib2" want?  Percent escapes?  Punycode?

Looks like Punycode is the correct answer:
https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

I haven't tried it, though.


   This is Python bug #9679:

http://bugs.python.org/issue9679

It's been open for years, and the maintainers offer elaborate
excuses for not fixing the problem.

The socket module accepts Unicode domains, as does httplib.
But urllib2, which is a front end to both, is still broken.
It's failing when it constructs the HTTP headers.  Domains
in HTTP headers have to be in punycode.

The code in stackoverflow doesn't really work right.  Only
the domain part of a URL should be converted to punycode.
Path, port, and query parameters need to be converted to
percent-encoding.  (Unclear if urllib2 or httplib does this
already.  The documentation doesn't say.)

While HTTP content can be in various character sets, the
headers are currently required to be ASCII only, since the
header has to be processed to determine the character code.
(http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html)

Here's a workaround, for the domain part only.


#
#   idnaurlworkaround  --  workaround for Python defect 9679
#
PYTHONDEFECT9679FIXED = False # Python defect #9679 - change when fixed

def idnaurlworkaround(url) :
"""
Convert a URL to a form the currently broken urllib2 will accept.
Converts the domain to "punycode" if necessary.
This is a workaround for Python defect #9679.
"""
if PYTHONDEFECT9679FIXED :  # if defect fixed
return(url)   # use unmodified URL
url = unicode(url)  # force to Unicode
(scheme, accesshost, path, params,
query, fragment) = urlparse.urlparse(url)# parse URL
if scheme == '' and accesshost == '' and path != '' : # bare domain
accesshost = path # use path as access host
path = '' # no path
labels = accesshost.split('.') # split domain into sections ("labels")
labels = [encodings.idna.ToASCII(w) for w in labels]# convert each 
label to punycode if necessary

accesshost = '.'.join(labels) # reassemble domain
url = urlparse.urlunparse((scheme, accesshost, path, params, query, 
fragment))  # reassemble url

return(url) # return complete URL with punycode domain

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Internationalized domain names not working with URLopen

2012-06-12 Thread John Nagle

I'm trying to open

http://пример.испытание

with

urllib2.urlopen(s1)

in Python 2.7 on Windows 7. This produces a Unicode exception:

>>> s1
u'http://\u043f\u0440\u0438\u043c\u0435\u0440.\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435'
>>> fd = urllib2.urlopen(s1)
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "C:\python27\lib\urllib2.py", line 394, in open
response = self._open(req, data)
  File "C:\python27\lib\urllib2.py", line 412, in _open
'_open', req)
  File "C:\python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
  File "C:\python27\lib\urllib2.py", line 1199, in http_open
return self.do_open(httplib.HTTPConnection, req)
  File "C:\python27\lib\urllib2.py", line 1168, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "C:\python27\lib\httplib.py", line 955, in request
self._send_request(method, url, body, headers)
  File "C:\python27\lib\httplib.py", line 988, in _send_request
self.putheader(hdr, value)
  File "C:\python27\lib\httplib.py", line 935, in putheader
hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values]))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
0-5: ordinal not in range(128)

>>>

The HTTP library is trying to put the URL in the header as ASCII.  Why 
isn't "urllib2" handling that?


What does "urllib2" want?  Percent escapes?  Punycode?

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: sqlite INSERT performance

2012-05-30 Thread John Nagle

On 5/30/2012 6:57 PM, duncan smith wrote:

Hello,
I have been attempting to speed up some code by using an sqlite
database, but I'm not getting the performance gains I expected.


SQLite is a "lite" database.  It's good for data that's read a
lot and not changed much.  It's good for small data files.  It's
so-so for large database loads.  It's terrible for a heavy load of 
simultaneous updates from multiple processes.


However, wrapping the inserts into a transaction with BEGIN
and COMMIT may help.

If you have 67 columns in a table, you may be approaching the
problem incorrectly.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Email Id Verification

2012-05-24 Thread John Nagle

On 5/24/2012 5:32 AM, niks wrote:

Hello everyone..
I am new to asp.net...
I want to use Regular Expression validator in Email id verification..
Can anyone tell me how to use this and what is the meaning of
this
\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*


   Not a Python question.

   It matches anything that looks like a mail user name followed by
an @ followed by anything that looks more or less like a domain name.
The domain name must contain at least one ".", and cannot end with
a ".", which is not strictly correct but usually works.

        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: escaping/encoding/formatting in python

2012-05-23 Thread John Nagle

On 4/5/2012 10:10 PM, Steve Howell wrote:

On Apr 5, 9:59 pm, rusi  wrote:

On Apr 6, 6:56 am, Steve Howell  wrote:



You've one-upped me with 2-to-the-N backspace escaping.


   Early attempts at UNIX word processing, "nroff" and "troff",
suffered from that problem, due to a badly designed macro system.

   A question in language design is whether to escape or quote.
Do you write

"X = %d" % (n,))

or

"X = " + str(n)

In general, for anything but output formatting, the second scales
better.  Regular expressions have a bad case of the first.
For a quoted alternative to regular expression syntax, see
SNOBOL or Icon.   SNOBOL allows naming patterns, and those patterns
can then be used as components of other patterns.  SNOBOL
is obsolete, but that approach produced much more readable
code.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: serial module

2012-05-22 Thread John Nagle

On 5/22/2012 2:07 PM, Paul Rubin wrote:

John Nagle  writes:

If a device is registered as /dev/ttyUSBnn, one would hope that
the Linux USB insertion event handler, which assigns that name,
determined that the device was a serial port emulator.  Unfortunately,
the USB standard device classes
(http://www.usb.org/developers/defined_class) don't have "serial port
emulator" as a standardized device.  So there's more variation in this
area than in keyboards, mice, or storage devices.


Hmm, I've been using USB-to-serial adapters and so far they've worked
just fine.  I plug the USB end of adapter into a Ubuntu box, see
/dev/ttyUSB* appear, plug the serial end into the external serial
device, and just use pyserial like with an actual serial port.  I didn't
realize there were issues with this.


   There are.  See "http://wiki.debian.org/usbserial";.  Because there's
no standard USB class for such devices, the specific vendor ID/product
ID pair has to be known to the OS.  In Linux, there's a file of these,
but not all USB to serial adapters are in it.  In Windows, there
tends to be a vendor-provided driver for each brand of USB to
serial converter.  This all would have been much simpler if the USB
Consortium had defined a USB class for these devices, as they did
for keyboards, mice, etc.

   However, this is not the original poster's problem.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: serial module

2012-05-22 Thread John Nagle

On 5/22/2012 8:42 AM, Grant Edwards wrote:

On 2012-05-22, Albert van der Horst  wrote:



It is anybody's guess what they do in USB.


They do exactly what they're supposed to regardless of what sort of
bus is used to connect the CPU and the UART (ISA, PCI, PCI-express,
USB, Ethernet, etc.).


   If a device is registered as /dev/ttyUSBnn, one would hope that
the Linux USB insertion event handler, which assigns that name,
determined that the device was a serial port emulator.  Unfortunately,
the USB standard device classes
(http://www.usb.org/developers/defined_class) don't have "serial port
emulator" as a standardized device.  So there's more variation in this
area than in keyboards, mice, or storage devices.



The best answers is probably that it depends on the whim of whoever
implements the usb device.


It does not depend on anybody's whim.  The meaning of those parameters
is well-defined.


Certainly this stuff is system dependant,


No, it isn't.


   It is, a little.  There's a problem with the way Linux does
serial ports.   The only speeds allowed are the ones nailed into the
kernel as named constants.  This is a holdover from UNIX, which is a
holdover from DEC PDP-11 serial hardware circa mid 1970s, which had
14 standard baud rates encoded in 4 bits.  Really.

   In the Windows world, the actual baud rate is passed to the
driver.  Serial ports on the original IBM PC were loaded with
a clock rate, so DOS worked that way.

   This only matters if you need non-standard baud rates.  I've
had to deal with that twice, for a SICK LMS LIDAR, (1,000,000 baud)
and 1930s Teletype machines (45.45 baud).

   If you need non-standard speeds, see this:

http://www.aetherltd.com/connectingusb.html

   If 19,200 baud is enough for you, don't worry about it.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Creating a directory structure and modifying files automatically in Python

2012-05-07 Thread John Nagle

On 5/7/2012 9:09 PM, Steve Howell wrote:

On May 7, 8:46 pm, John Nagle  wrote:

On 5/6/2012 9:59 PM, Paul Rubin wrote:


Javierwrites:

Or not... Using directories may be a way to do rapid prototyping, and
check quickly how things are going internally, without needing to resort
to complex database interfaces.



dbm and shelve are extremely simple to use.  Using the file system for a
million item db is ridiculous even for prototyping.


 Right.  Steve Bellovin wrote that back when UNIX didn't have any
database programs, let alone free ones.



It's kind of sad that the Unix file system doesn't serve as an
effective key-value store at any kind of nontrivial scale.  It would
simplify a lot of programming if filenames were keys and file contents
were values.


   You don't want to go there in a file system.  Some people I know
tried that around 1970.  "A bit is a file.  An ordered collection of 
files is a file".  Didn't work out.


   There are file models other than the UNIX one.  Many older systems
had file versioning.  Tandem built their file system on top of their
distributed, redundant database system.  There are backup systems
where the name of the file is its hash, allowing elimination of
duplicates.  Most of the "free online storage" sites do that.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Creating a directory structure and modifying files automatically in Python

2012-05-07 Thread John Nagle

On 5/6/2012 9:59 PM, Paul Rubin wrote:

Javier  writes:

Or not... Using directories may be a way to do rapid prototyping, and
check quickly how things are going internally, without needing to resort
to complex database interfaces.


dbm and shelve are extremely simple to use.  Using the file system for a
million item db is ridiculous even for prototyping.


   Right.  Steve Bellovin wrote that back when UNIX didn't have any
database programs, let alone free ones.

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: key/value store optimized for disk storage

2012-05-06 Thread John Nagle

On 5/4/2012 12:14 AM, Steve Howell wrote:

On May 3, 11:59 pm, Paul Rubin  wrote:

Steve Howell  writes:

 compressor = zlib.compressobj()
 s = compressor.compress("foobar")
 s += compressor.flush(zlib.Z_SYNC_FLUSH)



 s_start = s
 compressor2 = compressor.copy()


   That's awful. There's no point in compressing six characters
with zlib.  Zlib has a minimum overhead of 11 bytes.  You just
made the data bigger.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


"

2012-05-03 Thread John Nagle

  An HTML page for a major site (http://www.chase.com) has
some incorrect HTML.  It contains


Re: Python SOAP library

2012-05-02 Thread John Nagle

On 5/2/2012 8:35 AM, Alec Taylor wrote:

What's the best SOAP library for Python?
I am creating an API converter which will be serialising to/from a variety of 
sources, including REST and SOAP.
Relevant parsing is XML [incl. SOAP] and JSON.
Would you recommend: http://code.google.com/p/soapbox/

Or suggest another?
Thanks for all information,


   Are you implementing the client or the server?

   Python "Suds" is a good client-side library. It's strict SOAP;
you must have a WSDL file, and the XML queries and replies must
verify against the WSDL file.

https://fedorahosted.org/suds/

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Creating a directory structure and modifying files automatically in Python

2012-04-30 Thread John Nagle

On 4/30/2012 8:19 AM, deltaquat...@gmail.com wrote:

Hi,

I would like to automate the following task under Linux. I need to create a set 
of directories such as

075
095
100
125

The directory names may be read from a text file foobar, which also contains a 
number corresponding to each dir, like this:

075 1.818
095 2.181
100 2.579
125 3.019


In each directory I must copy a text file input.in. This file contains  two 
lines which need to be edited:


   Learn how to use a database.  Creating and managing a
big collection of directories to handle small data items is the
wrong approach to data storage.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-29 Thread John Nagle

On 4/28/2012 4:47 AM, Kiuhnm wrote:

On 4/27/2012 17:39, Adam Skutt wrote:

On Apr 27, 8:07 am, Kiuhnm wrote:

Useful... maybe, conceptually sound... no.
Conceptually, NaN is the class of all elements which are not numbers,
therefore NaN = NaN.


NaN isn't really the class of all elements which aren't numbers. NaN
is the result of a few specific IEEE 754 operations that cannot be
computed, like 0/0, and for which there's no other reasonable
substitute (e.g., infinity) for practical applications .

In the real world, if we were doing the math with pen and paper, we'd
stop as soon as we hit such an error. Equality is simply not defined
for the operations that can produce NaN, because we don't know to
perform those computations. So no, it doesn't conceptually follow
that NaN = NaN, what conceptually follows is the operation is
undefined because NaN causes a halt.


Mathematics is more than arithmetics with real numbers. We can use FP
too (we actually do that!). We can say that NaN = NaN but that's just an
exception we're willing to make. We shouldn't say that the equivalence
relation rules shouldn't be followed just because *sometimes* we break
them.


This is what programming languages ought to do if NaN is compared to
anything other than a (floating-point) number: disallow the operation
in the first place or toss an exception.


   If you do a signaling floating point comparison on IEEE floating
point numbers, you do get an exception.  On some FPUs, though,
signaling operations are slower.  On superscalar CPUs, exact
floating point exceptions are tough to implement.  They are
done right on x86 machines, mostly for backwards compatibility.
This requires an elaborate "retirement unit" to unwind the
state of the CPU after a floating point exception.  DEC Alphas
didn't have that; SPARC and MIPS machines varied by model.
ARM machines in their better modes do have that.
Most game console FPUs do not have a full IEEE implementation.

   Proper language support for floating point exceptions varies
with the platform.  Microsoft C++ on Windows does support
getting it right.  (I had to deal with this once in a physics
engine, where an overflow or a NaN merely indicated that a
shorter time step was required.)  But even there, it's
an OS exception, like a signal, not a language-level
exception.  Other than Ada, which requires it, few
languages handle such exceptions as language level
exceptions.


John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-29 Thread John Nagle

On 4/28/2012 1:04 PM, Paul Rubin wrote:

Roy Smith  writes:

I agree that application-level name cacheing is "wrong", but sometimes
doing it the wrong way just makes sense.  I could whip up a simple
cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the
environment (both technology and bureaucracy), getting a cacheing
nameserver installed might take anywhere from 5 minutes to a few days to ...


IMHO this really isn't one of those times.  The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine.  They would
benefit from being able to share the cache, so now your wrapper gets
more complicated.  If it's not a nameserver then it's something that
fills in for one.  And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines.  So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.


Each process is analyzing one web site, and has its own cache.
Once the site is analyzed, which usually takes about a minute,
the cache disappears.  Multiple threads are reading multiple pages
from the web site during that time.

A local cache is enough to fix the huge overhead problem of
doing a DNS lookup for every link found.  One site with a vast
number of links took over 10 hours to analyze before this fix;
now it takes about four minutes.  That solved the problem.
We can probably get an additional minor performance boost with a real
local DNS daemon, and will probably configure one.

We recently changed servers from Red Hat to CentOS, and management
from CPanel to Webmin.  Before the change, we had a local DNS daemon
with cacheing, so we didn't have this problem.  Webmin's defaults
tend to be on the minimal side.

The DNS information is used mostly to help decide whether two URLs
actually point to the same IP address, as part of deciding whether a
link is on-site or off-site.  Most of those links will never be read.
We're not crawling the entire site, just looking at likely pages to
find the name and address of the business behind the site.  (It's
part of our "Know who you're dealing with" system, SiteTruth.)
    
John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 9:55 PM, Paul Rubin wrote:

John Nagle  writes:

I may do that to prevent the stall.  But the real problem was all
those DNS requests.  Parallizing them wouldn't help much when it took
hours to grind through them all.


True dat.  But building a DNS cache into the application seems like a
kludge.  Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.


   I know.  When I have a bit more time, I'll figure out why
CentOS 5 and Webmin didn't set up a caching DNS resolver by
default.

   Sometimes the number of requests IS insane.  When the
system hits a page with a thousand links, it has to resolve
all of them.  (Beyond a thousand links, we classify it as
link spam and stop.  The record so far is a page with over
10,000 links.)

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 9:20 PM, Paul Rubin wrote:

John Nagle  writes:


The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.


Don't do that!!


Added a local cache in the program to prevent this.
Performance much improved.


Better to release the lock while the getaddrinfo is running, if you can.


   I may do that to prevent the stall.  But the real problem was all
those DNS requests.  Parallizing them wouldn't help much when it took
hours to grind through them all.

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 6:25 PM, Adam Skutt wrote:

On Apr 27, 2:54 pm, John Nagle  wrote:

  I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.


How exactly are you determining that this is the case?


   Found the problem.  The threads, after doing their compute
intensive work of examining pages, stored some URLs they'd found.
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.  On CentOS, "getaddrinfo()" at the
glibc level doesn't always cache locally (ref
https://bugzilla.redhat.com/show_bug.cgi?id=576801).  Python
doesn't cache either.  So huge numbers of DNS requests were being
made.  For some pages being scanned, many of the domains required
accessing a rather slow  DNS server.  The combination of thousands
of instances of the same domain, a slow DNS server, and no caching
slowed the crawler down severely.

   Added a local cache in the program to prevent this.
Performance much improved.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


CPython thread starvation

2012-04-27 Thread John Nagle

I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.
It's sleeping for 0.5 seconds, then checking on the other threads
and for new work do to, so the work thread isn't using much
compute time.

   I know that the CPython thread dispatcher sucks, but I didn't
realize it sucked that bad.  Is there a preference for running
threads at the head of the list (like UNIX, circa 1979) or
something like that?

   (And yes, I know about "multiprocessing".  These threads are already
in one of several service processes.  I don't want to launch even more
copies of the Python interpreter.  The threads are usually I/O bound,
but when they hit unusually long web pages, they go compute-bound
during parsing.)

   Setting "sys.setcheckinterval" from the default to 1 seems
to have little effect.  This is on Windows 7.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-26 Thread John Nagle

On 4/26/2012 4:45 AM, Adam Skutt wrote:

On Apr 26, 1:48 am, John Nagle  wrote:

On 4/25/2012 5:01 PM, Steven D'Aprano wrote:


On Wed, 25 Apr 2012 13:49:24 -0700, Adam Skutt wrote:



Though, maybe it's better to use a different keyword than 'is' though,
due to the plain English
connotations of the term; I like 'sameobj' personally, for whatever
little it matters.  Really, I think taking away the 'is' operator
altogether is better, so the only way to test identity is:
  id(x) == id(y)



Four reasons why that's a bad idea:



1) The "is" operator is fast, because it can be implemented directly by
the interpreter as a simple pointer comparison (or equivalent).


 This assumes that everything is, internally, an object.  In CPython,
that's the case, because Python is a naive interpreter and everything,
including numbers, is "boxed".  That's not true of PyPy or Shed Skin.
So does "is" have to force the creation of a temporary boxed object?


That's what C# does AFAIK.  Java defines '==' as value comparison for
primitives and '==' as identity comparison for objects, but I don't
exactly know how one would do that in Python.


   I would suggest that "is" raise ValueError for the ambiguous cases.
If both operands are immutable, "is" should raise ValueError.
That's the case where the internal representation of immutables
shows through.

   If this breaks a program, it was broken anyway.  It will
catch bad comparisons like

if x is 1000 :
...

which is implementation dependent.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-25 Thread John Nagle

On 4/25/2012 5:01 PM, Steven D'Aprano wrote:

On Wed, 25 Apr 2012 13:49:24 -0700, Adam Skutt wrote:


Though, maybe it's better to use a different keyword than 'is' though,
due to the plain English
connotations of the term; I like 'sameobj' personally, for whatever
little it matters.  Really, I think taking away the 'is' operator
altogether is better, so the only way to test identity is:
 id(x) == id(y)


Four reasons why that's a bad idea:

1) The "is" operator is fast, because it can be implemented directly by
the interpreter as a simple pointer comparison (or equivalent).


   This assumes that everything is, internally, an object.  In CPython,
that's the case, because Python is a naive interpreter and everything,
including numbers, is "boxed".  That's not true of PyPy or Shed Skin.
So does "is" have to force the creation of a temporary boxed object?

   The concept of "object" vs. the implementation of objects is
one reason you don't necessarily want to expose the implementation.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-23 Thread John Nagle

On 4/22/2012 9:34 PM, Steven D'Aprano wrote:

On Sun, 22 Apr 2012 12:43:36 -0700, John Nagle wrote:


On 4/20/2012 9:34 PM, john.tant...@gmail.com wrote:

On Friday, April 20, 2012 12:34:46 PM UTC-7, Rotwang wrote:


I believe it says somewhere in the Python docs that it's undefined and
implementation-dependent whether two identical expressions have the
same identity when the result of each is immutable


 Bad design.  Where "is" is ill-defined, it should raise ValueError.


"is" is never ill-defined. "is" always, without exception, returns True
if the two operands are the same object, and False if they are not. This
is literally the simplest operator in Python.

John, you've been using Python for long enough that you should know this.
I can only guess that you are trolling, although I can't imagine why.


   Because the language definition should not be what CPython does.
As PyPy advances, we need to move beyond that.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-22 Thread John Nagle

On 4/22/2012 3:17 PM, John Roth wrote:

On Sunday, April 22, 2012 1:43:36 PM UTC-6, John Nagle wrote:

On 4/20/2012 9:34 PM, john.tant...@gmail.com wrote:

On Friday, April 20, 2012 12:34:46 PM UTC-7, Rotwang wrote:


I believe it says somewhere in the Python docs that it's
undefined and implementation-dependent whether two identical
expressions have the same identity when the result of each is
immutable


Bad design.  Where "is" is ill-defined, it should raise
ValueError.

A worse example, one which is very implementation-dependent:

http://stackoverflow.com/questions/306313/python-is-operator-behaves-unexpectedly-with-integers




a = 256

b = 256 a is b

True   # this is an expected result

a = 257 b = 257 a is b

False

Operator "is" should be be an error between immutables unless one
is a built-in constant.  ("True" and "False" should be made hard
constants, like "None". You can't assign to None, but you can
assign to True, usually with unwanted results.  It's not clear why
True and False weren't locked down when None was.)

John Nagle


Three points. First, since there's no obvious way of telling whether
an arbitrary user-created object is immutable, trying to make "is"
fail in that case would be a major change to the language.


   If a program fails because such a comparison becomes invalid, it
was broken anyway.

   The idea was borrowed from LISP, which has both "eq" (pointer 
equality) and and "equals" (compared equality).  It made somewhat

more sense in the early days of LISP, when the underlying
representation of everything was well defined.


Second: the definition of "is" states that it determines whether two
objects are the same object; this has nothing to do with mutability
or immutability.

The id([]) == id([]) thing is a place where cPython's implementation
is showing through. It won't work that way in any implementation that
uses garbage collection and object compaction. I think Jython does it
that way, I'm not sure about either IronPython or PyPy.


   That represents a flaw in the language design - the unexpected
exposure of an implementation dependency.


Third: True and False are reserved names and cannot be assigned to in
the 3.x series. They weren't locked down in the 2.x series when they
were introduced because of backward compatibility.


That's one of the standard language designer fuckups.  Somebody
starts out thinking that 0 and 1 don't have to be distinguished from
False and True.  When they discover that they do, the backwards
compatibility sucks.  C still suffers from this.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: global vars across modules

2012-04-22 Thread John Nagle

On 4/22/2012 12:39 PM, mambokn...@gmail.com wrote:



Question:
How can I access to the global 'a' in file_2 without resorting to the whole 
name 'file_1.a' ?


Actually, it's better to use the fully qualified name "file_1.a". 
Using "import *" brings in everything in the other module, which often

results in a name clash.

Just do

import file_1

and, if desired

localnamefora = file_1.a



    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: why () is () and [] is [] work in other way?

2012-04-22 Thread John Nagle

On 4/20/2012 9:34 PM, john.tant...@gmail.com wrote:

On Friday, April 20, 2012 12:34:46 PM UTC-7, Rotwang wrote:


I believe it says somewhere in the Python docs that it's undefined and
implementation-dependent whether two identical expressions have the same
identity when the result of each is immutable


   Bad design.  Where "is" is ill-defined, it should raise ValueError.

A worse example, one which is very implementation-dependent:

http://stackoverflow.com/questions/306313/python-is-operator-behaves-unexpectedly-with-integers

>>> a = 256
>>> b = 256
>>> a is b
True   # this is an expected result
>>> a = 257
>>> b = 257
>>> a is b
False

Operator "is" should be be an error between immutables
unless one is a built-in constant.  ("True" and "False"
should be made hard constants, like "None". You can't assign
to None, but you can assign to True, usually with
unwanted results.  It's not clear why True and False
weren't locked down when None was.)

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Deep merge two dicts?

2012-04-12 Thread John Nagle

On 4/12/2012 10:41 AM, Roy Smith wrote:

Is there a simple way to deep merge two dicts?  I'm looking for Perl's
Hash::Merge (http://search.cpan.org/~dmuey/Hash-Merge-0.12/Merge.pm)
in Python.


def dmerge(a, b) :
   for k in a :
v = a[k]
if isinstance(v, dict) and k in b:
dmerge(v, b[k])
   a.update(b)



--
http://mail.python.org/mailman/listinfo/python-list


Re: python module development workflow

2012-04-12 Thread John Nagle

On 4/11/2012 1:04 PM, Miki Tebeka wrote:

Could any expert suggest an authoritative and complete guide for
developing python modules? Thanks!

I'd start with http://docs.python.org/distutils/index.html


Make sure that

python setup.py build
python setup.py install

works.

Don't use the "rotten egg" distribution system.
(http://packages.python.org/distribute/easy_install.html)

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Donald E. Knuth in Python, cont'd

2012-04-11 Thread John Nagle

On 4/11/2012 6:03 AM, Antti J Ylikoski wrote:


I wrote about a straightforward way to program D. E. Knuth in Python,
and received an excellent communcation about programming Deterministic
Finite Automata (Finite State Machines) in Python.

The following stems from my Knuth in Python programming exercises,
according to that very good communication. (By Roy Smith.)

I'm in the process of delving carefully into Knuth's brilliant and
voluminous work The Art of Computer Programming, Parts 1--3 plus the
Fascicles in Part 4 -- the back cover of Part 1 reads:

"If you think you're a really good programmer -- read [Knuth's] Art of
Computer Programming... You should definitely send me a résumé if you
can read the whole thing." -- Bill Gates.

(Microsoft may in the future receive some e-mail from me.)


You don't need those books as much as you used to.
You don't have to write collections, hash tables, and sorts much
any more.  Those are solved problems and there are good libraries.
Most of the basics are built into Python.

Serious programmers should read those books, much as they should
read von Neumann's "First Draft of a Report on the EDVAC", for
background on how things work down at the bottom.  But they're
no longer essential desk references for most programmers.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Python Gotcha's?

2012-04-08 Thread John Nagle

On 4/8/2012 10:55 AM, Miki Tebeka wrote:

8.  Opening a URL can result in an unexpected prompt on
standard input if the URL has authentication.  This can
stall servers.

Can you give an example? I don't think anything in the standard library does 
that.


   It's in "urllib".  See

http://docs.python.org/library/urllib.html

"When performing basic authentication, a FancyURLopener instance calls 
its prompt_user_passwd() method. The default implementation asks the 
users for the required information on the controlling terminal. A 
subclass may override this method to support more appropriate behavior 
if needed."


A related "gotcha" is knowing that "urllib" sucks and you should use
"urllib2".

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Python Gotcha's?

2012-04-07 Thread John Nagle

On 4/4/2012 3:34 PM, Miki Tebeka wrote:

Greetings,

I'm going to give a "Python Gotcha's" talk at work.
If you have an interesting/common "Gotcha" (warts/dark corners ...) please 
share.

(Note that I want over http://wiki.python.org/moin/PythonWarts already).

Thanks,
--
Miki


A few Python "gotchas":

1.  Nobody is really in charge of third party packages.  In the
Perl world, there's a central repository, CPAN, and quality
control.  Python's "pypi" is just a collection of links.  Many
major packages are maintained by one person, and if they lose
interest, the package dies.

2.  C extensions are closely tied to the exact version of CPython
you're using, and finding a properly built version may be difficult.

3.  "eggs".  The "distutils" system has certain assumptions built into
it about where things go, and tends to fail in obscure ways.  There's
no uniform way to distribute a package.

4.  The syntax for expression-IF is just weird.

5.  "+" as concatenation.  This leads to strange numerical
semantics, such as (1,2) + (3,4) is (1,2,3,4).  But, for
"numarray" arrays, "+" does addition.  What does a mixed
mode expression of a numarray and a tuple do?  Guess.

5.  It's really hard to tell what's messing with the
attributes of a class, since anything can store into
anything.  This creates debugging problems.

6.  Multiple inheritance is a mess.  Especially "super".

7.  Using attributes as dictionaries can backfire.  The
syntax of attributes is limited.  So turning XML or HTML
structures into Python objects creates problems.

8.  Opening a URL can result in an unexpected prompt on
standard input if the URL has authentication.  This can
stall servers.

9.  Some libraries aren't thread-safe.  Guess which ones.

10. Python 3 isn't upward compatible with Python 2.

John Nagle


--
http://mail.python.org/mailman/listinfo/python-list


Re: getaddrinfo NXDOMAIN exploit - please test on CentOS 6 64-bit

2012-04-04 Thread John Nagle

On 4/2/2012 6:53 PM, John Nagle wrote:

On 4/1/2012 1:41 PM, John Nagle wrote:

On 4/1/2012 9:26 AM, Michael Torrie wrote:

On 03/31/2012 04:58 PM, John Nagle wrote:



Removed all "search" and "domain" entries from /etc/resolve.conf


It's a design bug in glibc. I just submitted a bug report.

http://sourceware.org/bugzilla/show_bug.cgi?id=13935


  The same bug is in "dnspython". Submitted a bug report there,
too.

   https://github.com/rthalley/dnspython/issues/6

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to structure data for efficient searching

2012-04-03 Thread John Nagle

On 3/28/2012 11:39 AM, larry.mart...@gmail.com wrote:

I have the following use case:

I have a set of data that is contains 3 fields, K1, K2 and a
timestamp. There are duplicates in the data set, and they all have to
processed.

Then I have another set of data with 4 fields: K3, K4, K5, and a
timestamp. There are also duplicates in that data set, and they also
all have to be processed.

I need to find all the items in the second data set where K1==K3 and
K2==K4 and the 2 timestamps are within 20 seconds of each other.

I have this working, but the way I did it seems very inefficient - I
simply put the data in 2 arrays (as tuples) and then walked through
the entire second data set once for each item in the first data set,
looking for matches.

Is there a better, more efficient way I could have done this?


   How big are the data sets?  Millions of entries?  Billions?
Trillions?  Will all the data fit in memory, or will this need
files or a database.

   In-memory, it's not hard.  First, decide which data set is smaller.
That one gets a dictionary keyed by K1 or K3, with each entry being
a list of tuples.  Then go through the other data set linearly.

   You can also sort one database by K1, the other by K3, and
match.  Then take the matches, sort by K2 and K4, and match again.
Sort the remaining matches by timestamp and pull the ones within
the threshold.

   Or you can load all the data into a database with a query
optimizer, like MySQL, and let it figure out, based on the
index sizes, how to do the join.

   All of these approaches are roughly O(N log N), which
beats the O(N^2) approach you have now.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: getaddrinfo NXDOMAIN exploit - please test on CentOS 6 64-bit

2012-04-02 Thread John Nagle

On 4/1/2012 1:41 PM, John Nagle wrote:

On 4/1/2012 9:26 AM, Michael Torrie wrote:

On 03/31/2012 04:58 PM, John Nagle wrote:



Removed all "search" and "domain" entries from /etc/resolve.conf


It's a design bug in glibc. I just submitted a bug report.

http://sourceware.org/bugzilla/show_bug.cgi?id=13935

It only appears if you have a machine with a two-component domain
name ending in ".com" as the actual machine name. Most hosting
services generate some long arbitrary name as the primary name,
but I happen to have a server set up as "companyname.com".

The default rule for looking up domains in glibc is that the
"domain" is everything after the FIRST ".". Failed lookups
are retried with that "domain" appended. The idea, back
in the 1980s, was that if you're on "foo.bigcompany.com",
and look up "bar", it's looked up as "bar.bigcompany.com".
This idea backfires when the actual hostname only
has two components, and the search just appends ".com".

There is a "com.com" domain, and this gets them traffic.
They exploit this to send you (where else) to an ad-heavy page.
Try "python.com.com", for example,and you'll get an ad for a
Java database.

The workaround in Python is to add the AI_CANONNAME flag
to getaddrinfo calls, then check that the returned domain
name matches the one put in.


   That workaround won't work for some domains.  For example,

>>> socket.getaddrinfo(s,"http",0,0,socket.SOL_TCP,socket.AI_CANONNAME)
[(2, 1, 6, 'orig-10005.themarker.cotcdn.net', ('208.93.137.80', 80))]

   Nor will addiing options to /etc/resolv.conf work well, because
that file is overwritten by some system administration programs.

   I may have to bring in "dnspython" to get a reliable DNS lookup.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Will MySQL ever be supported for Python 3.x?

2012-04-01 Thread John Nagle

On 3/31/2012 10:54 PM, Tim Roberts wrote:

John Nagle  wrote:


On 3/30/2012 2:32 PM, Irmen de Jong wrote:

Try Oursql instead  http://packages.python.org/oursql/
"oursql is a new set of MySQL bindings for python 2.4+, including python 3.x"


Not even close to being compatible with existing code.   Every SQL
statement has to be rewritten, with the parameters expressed
differently.  It's a good approach, but very incompatible.


Those changes can be automated, given an adequate editor.  "Oursql" is a
far better product than the primitive MySQLdb wrapper.  It is worth the
trouble.


It's an interesting approach.  As it matures, and a few big sites
use it. it will become worth looking at.

The emphasis on server-side buffering seems strange.  Are there
benchmarks indicating this is worth doing?  Does it keep transactions
locked longer?  This bug report

https://answers.launchpad.net/oursql/+question/191256

indicates a performance problem.  I'd expect server side buffering
to slow things down.  Usually, you want to drain results out of
the server as fast as possible, then close out the command,
releasing server resources and locks.

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: [OT] getaddrinfo NXDOMAIN exploit - please test on CentOS 6 64-bit

2012-04-01 Thread John Nagle

On 4/1/2012 9:26 AM, Michael Torrie wrote:

On 03/31/2012 04:58 PM, John Nagle wrote:

If you can make this happen, report back the CentOS version and
the library version, please.


CentOS release 6.2 (Final)
glibc-2.12-1.47.el6_2.9.x86_64

example does not ping
example.com does not resolve to example.com.com

Removed all "search" and "domain" entries from /etc/resolve.conf


It's a design bug in glibc. I just submitted a bug report.

  http://sourceware.org/bugzilla/show_bug.cgi?id=13935

It only appears if you have a machine with a two-component domain
name ending in ".com" as the actual machine name.  Most hosting
services generate some long arbitrary name as the primary name,
but I happen to have a server set up as "companyname.com".

The default rule for looking up domains in glibc is that the
"domain" is everything after the FIRST ".".  Failed lookups
are retried with that "domain" appended.  The idea, back
in the 1980s, was that if you're on "foo.bigcompany.com",
and look up "bar", it's looked up as "bar.bigcompany.com".
This idea backfires when the actual hostname only
has two components, and the search just appends ".com".

There is a "com.com" domain, and this gets them traffic.
They exploit this to send you (where else) to an ad-heavy page.
Try "python.com.com", for example,and you'll get an ad for a
Java database.

The workaround in Python is to add the AI_CANONNAME flag
to getaddrinfo calls, then check that the returned domain
name matches the one put in.

Good case:
>>> s = "python.org"
>>> socket.getaddrinfo(s, 80, 0,0, 0, socket.AI_CANONNAME)
[(2, 1, 6, 'python.org', ('82.94.164.162', 80)), (2, 2, 17, '', 
('82.94.164.162', 80)), (2, 3, 0, '', ('82.94.164.162', 80)), (10, 1, 6, 
'', ('2001:888:2000:d::a2', 80, 0, 0)), (10, 2, 17, '', 
('2001:888:2000:d::a2', 80, 0, 0)), (10, 3, 0, '', 
('2001:888:2000:d::a2', 80, 0, 0))]


Bad case:
>>> s = "noexample.com"
>>> socket.getaddrinfo(s, 80, 0,0, 0, socket.AI_CANONNAME)
[(2, 1, 6, 'phx1-ss-2-lb.cnet.com', ('64.30.224.112', 80)), (2, 2, 17, 
'', ('64.30.224.112', 80)), (2, 3, 0, '', ('64.30.224.112', 80))]


Note that what went in isn't what came back.  getaddrinfo has
been pwned.

Again, you only get this if you're on a machine whose primary host
name is "something.com", with exactly two components ending in ".com".


John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: getaddrinfo NXDOMAIN exploit - please test on CentOS 6 64-bit

2012-03-31 Thread John Nagle

On 3/31/2012 9:26 PM, Owen Jacobson wrote:

On 2012-03-31 22:58:45 +, John Nagle said:


Some versions of CentOS 6 seem to have a potential
getaddrinfo exploit. See

To test, try this from a command line:

ping example

If it fails, good. If it returns pings from "example.com", bad.
The getaddrinfo code is adding ".com" to the domain.


There is insufficient information in your diagnosis to make that
conclusion. For example: what network configuration services (DHCP
clients and whatnot, along with various desktop-mode configuration tools
and services) are running? What kernel and libc versions are you
running? What are the contents of /etc/nsswitch.conf? Of
/etc/resolv.conf (particularly, the 'search' entries)? What do
/etc/hosts, LDAP, NIS+, or other hostname services say about the names
you're resolving? Does a freestanding C program that directly calls
getaddrinfo and that runs in a known-good loader environment exhibit the
same surprises? Name resolution is not so simple that you can conclude
"getaddrinfo is misbehaving" from the behaviour of ping, or of your
Python sample, alone.

In any case, this seems more appropriate for a Linux or a CentOS
newsgroup/mailing list than a Python one. Please do not reply to this
post in comp.lang.python.

-o


   I expected that some noob would have a reply like that.

   A more detailed discussion appears here:

http://serverfault.com/questions/341383/possible-nxdomain-hijacking

    John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


getaddrinfo NXDOMAIN exploit - please test on CentOS 6 64-bit

2012-03-31 Thread John Nagle

   Some versions of CentOS 6 seem to have a potential
getaddrinfo exploit.  See

To test, try this from a command line:

ping example

If it fails, good.  If it returns pings from "example.com", bad.
The getaddrinfo code is adding ".com" to the domain.

If that returns pings, please try

ping noexample.com

There is no "noexample.com" domain in DNS.  This should time out.
But if you get ping replies from a CNET site, let me know.
Some implementations try "noexample.com", get a NXDOMAIN error,
and try again, adding ".com".  This results in a ping of
"noexample.com,com".  "com.com" is a real domain, run by a
unit of CBS, and they have their DNS set up to catch all
subdomains and divert them to, inevitably, an ad-oriented
junk search page.  (You can view the junk page at
"http://slimeball.com.com";.  Replace "slimeball" with anything
else you like; it will still resolve.)

If you find a case where "ping noexample.com" returns a reply,
then try it in Python:


import socket
socket.getaddrinfo("noexample.com", 80)

That should return an error.  If it returns the IP address of
CNET's ad server, there's trouble.

This isn't a problem with the upstream DNS.  Usually, this sort
of thing means you're using some sleazy upstream DNS provider
like Comcast.  That's not the case here.  "host" and "nslookup"
aren't confused.  Only programs that use getaddrinfo, like "ping",
"wget", and Python, have this ".com" appending thing.  Incidentally,
if you try "noexample.net", there's no problem, because the
owner of "net.com" hasn't set up their DNS to exploit this.

And, of course, it has nothing to do with browser toolbars.  This
is at a much lower level.

If you can make this happen, report back the CentOS version and
the library version, please.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Will MySQL ever be supported for Python 3.x?

2012-03-30 Thread John Nagle

On 3/30/2012 2:32 PM, Irmen de Jong wrote:

Try Oursql instead  http://packages.python.org/oursql/
"oursql is a new set of MySQL bindings for python 2.4+, including python 3.x"


   Not even close to being compatible with existing code.   Every SQL
statement has to be rewritten, with the parameters expressed
differently.  It's a good approach, but very incompatible.

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Will MySQL ever be supported for Python 3.x?

2012-03-30 Thread John Nagle

The MySQLdb entry on SourceForge
(http://sourceforge.net/projects/mysql-python/)
web site still says the last supported version of Python is 2.6.
PyPi says the last supported version is Python 2.5.  The
last download is from 2007.

I realize there are unsupported fourth-party versions from other
sources. (http://www.lfd.uci.edu/~gohlke/pythonlibs/) But those
are just blind builds; they haven't been debugged.

MySQL Connector (http://forge.mysql.com/projects/project.php?id=302)
is still pre-alpha.

    John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   4   5   6   7   8   9   10   >