Re: In-place memory manager, mmap (was: Fastest way to store ints and floats on disk)

2008-08-24 Thread Kris Kennaway

castironpi wrote:

Hi,

I've got an in-place memory manager that uses a disk-backed memory-
mapped buffer.  Among its possibilities are: storing variable-length
strings and structures for persistence and interprocess communication
with mmap.

It allocates segments of a generic buffer by length and returns an
offset to the reserved block, which can then be used with struct to
pack values to store.  The data structure is adapted from the GNU PAVL
binary tree.

Allocated blocks can be cast to ctypes.Structure instances using some
monkey patching, which is optional.

Want to open-source it.  Any interest?


Just do it.  That way users can come along later.

Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: In-place memory manager, mmap

2008-08-24 Thread Kris Kennaway

castironpi wrote:

On Aug 24, 9:52 am, Kris Kennaway [EMAIL PROTECTED] wrote:

castironpi wrote:

Hi,
I've got an in-place memory manager that uses a disk-backed memory-
mapped buffer.  Among its possibilities are: storing variable-length
strings and structures for persistence and interprocess communication
with mmap.
It allocates segments of a generic buffer by length and returns an
offset to the reserved block, which can then be used with struct to
pack values to store.  The data structure is adapted from the GNU PAVL
binary tree.
Allocated blocks can be cast to ctypes.Structure instances using some
monkey patching, which is optional.
Want to open-source it.  Any interest?

Just do it.  That way users can come along later.

Kris


How?  My website?  Google Code?  Too small for source forge, I think.
--
http://mail.python.org/mailman/listinfo/python-list




Any of those 3 would work fine, but the last two are probably better 
(sourceforge hosts plenty of tiny projects) if you don't want to have to 
manage your server and related infrastructure yourself.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: benchmark

2008-08-11 Thread Kris Kennaway

Peter Otten wrote:

[EMAIL PROTECTED] wrote:


On Aug 10, 10:10 pm, Kris Kennaway [EMAIL PROTECTED] wrote:

jlist wrote:

I think what makes more sense is to compare the code one most
typically writes. In my case, I always use range() and never use psyco.
But I guess for most of my work with Python performance hasn't been
a issue. I haven't got to write any large systems with Python yet,
where performance starts to matter.

Hopefully when you do you will improve your programming practices to not
make poor choices - there are few excuses for not using xrange ;)

Kris

And can you shed some light on how that relates with one of the zens
of python ?

There should be one-- and preferably only one --obvious way to do it.


For the record, the impact of range() versus xrange() is negligable -- on my
machine the xrange() variant even runs a tad slower. So it's not clear
whether Kris actually knows what he's doing.


You are only thinking in terms of execution speed.  Now think about 
memory use.  Using iterators instead of constructing lists is something 
that needs to permeate your thinking about python or you will forever be 
writing code that wastes memory, sometimes to a large extent.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: SSH utility

2008-08-11 Thread Kris Kennaway

James Brady wrote:

Hi all,
I'm looking for a python library that lets me execute shell commands
on remote machines.

I've tried a few SSH utilities so far: paramiko, PySSH and pssh;
unfortunately all been unreliable, and repeated questions on their
respective mailing lists haven't been answered...

It seems like the sort of commodity task that there should be a pretty
robust library for. Are there any suggestions for alternative
libraries or approaches?


Personally I just Popen ssh directly.  Things like paramiko make me 
concerned; getting the SSH protocol right is tricky and not something I 
want to trust to projects that have not had significant experience and 
auditing.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: benchmark

2008-08-11 Thread Kris Kennaway

Peter Otten wrote:

Kris Kennaway wrote:


Peter Otten wrote:

[EMAIL PROTECTED] wrote:


On Aug 10, 10:10 pm, Kris Kennaway [EMAIL PROTECTED] wrote:

jlist wrote:

I think what makes more sense is to compare the code one most
typically writes. In my case, I always use range() and never use
psyco. But I guess for most of my work with Python performance hasn't
been a issue. I haven't got to write any large systems with Python
yet, where performance starts to matter.

Hopefully when you do you will improve your programming practices to
not make poor choices - there are few excuses for not using xrange ;)

Kris

And can you shed some light on how that relates with one of the zens
of python ?

There should be one-- and preferably only one --obvious way to do it.

For the record, the impact of range() versus xrange() is negligable -- on
my machine the xrange() variant even runs a tad slower. So it's not clear
whether Kris actually knows what he's doing.
You are only thinking in terms of execution speed.  


Yes, because my remark was made in the context of the particular benchmark
supposed to be the topic of this thread.


No, you may notice that the above text has moved off onto another 
discussion.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: benchmark

2008-08-10 Thread Kris Kennaway

Angel Gutierrez wrote:

Steven D'Aprano wrote:


On Thu, 07 Aug 2008 00:44:14 -0700, alex23 wrote:


Steven D'Aprano wrote:

In other words, about 20% of the time he measures is the time taken to
print junk to the screen.

Which makes his claim that all the console outputs have been removed so
that the benchmarking activity is not interfered with by the IO
overheads somewhat confusing...he didn't notice the output? Wrote it
off as a weird Python side-effect?

Wait... I've just remembered, and a quick test confirms... Python only
prints bare objects if you are running in a interactive shell. Otherwise
output of bare objects is suppressed unless you explicitly call print.

Okay, I guess he is forgiven. False alarm, my bad.



Well.. there must be somthing because this is what I got in a normal script
execution:

[EMAIL PROTECTED] test]$ python iter.py
Time per iteration = 357.467989922 microseconds
[EMAIL PROTECTED] test]$ vim iter.py
[EMAIL PROTECTED] test]$ python iter2.py
Time per iteration = 320.306909084 microseconds
[EMAIL PROTECTED] test]$ vim iter2.py
[EMAIL PROTECTED] test]$ python iter2.py
Time per iteration = 312.917997837 microseconds


What is the standard deviation on those numbers?  What is the confidence 
level that they are distinct?  In a thread complaining about poor 
benchmarking it's disappointing to see crappy test methodology being 
used to try and demonstrate flaws in the test.


Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: benchmark

2008-08-10 Thread Kris Kennaway

jlist wrote:

I think what makes more sense is to compare the code one most
typically writes. In my case, I always use range() and never use psyco.
But I guess for most of my work with Python performance hasn't been
a issue. I haven't got to write any large systems with Python yet, where
performance starts to matter.


Hopefully when you do you will improve your programming practices to not 
make poor choices - there are few excuses for not using xrange ;)


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: Constructing MIME message without loading message stream

2008-08-10 Thread Kris Kennaway

Diez B. Roggisch wrote:

Kris Kennaway schrieb:
I would like to MIME encode a message from a large file without first 
loading the file into memory.  Assume the file has been pre-encoded on 
disk (actually I am using encode_7or8bit, so the encoding should be 
null).  Is there a way to construct the flattened MIME message such 
that data is streamed from the file as needed instead of being 
resident in memory?  Do I have to subclass the MIMEBase class myself?


I don't know what you are after here - but I *do* know that anything 
above 10MB or so is most probably not transferable using mail, as MTAs 
impose limits on message-sizes. Or in other words: usually, whatever you 
want to encode should fit in memory as the network is limiting you.


MIME encoding is used for other things than emails.

Kris
--
http://mail.python.org/mailman/listinfo/python-list


Constructing MIME message without loading message stream

2008-08-09 Thread Kris Kennaway
I would like to MIME encode a message from a large file without first 
loading the file into memory.  Assume the file has been pre-encoded on 
disk (actually I am using encode_7or8bit, so the encoding should be 
null).  Is there a way to construct the flattened MIME message such that 
data is streamed from the file as needed instead of being resident in 
memory?  Do I have to subclass the MIMEBase class myself?


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: variable expansion with sqlite

2008-08-08 Thread Kris Kennaway

marc wyburn wrote:

Hi and thanks,

I was hoping to avoid having to weld qmarks together but I guess
that's why people use things like SQL alchemy instead.  It's a good
lesson anyway.


The '?' substitution is there to safely handle untrusted input.  You 
*don't* want to pass in arbitrary user data into random parts of an SQL 
statement (or your database will get 0wned).  I think of it as a 
reminder that when you have to construct your own query template by 
using ... %s ... % (foo) to bypass this limitation, that you had 
better be darn sure the parameters you are passing in are safe.


Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: pyprocessing/multiprocessing for x64?

2008-08-07 Thread Kris Kennaway

Benjamin Kaplan wrote:

The only problem I can see is that 32-bit programs can't access 64-bit 
dlls, so the OP might have to install the 32-bit version of Python for 
it to work.


Anyway, all of this is beside the point, because the multiprocessing 
module works fine on amd64 systems.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-10 Thread Kris Kennaway

John Machin wrote:


Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the Str cases and the prefixes of the
re cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.


If it was just strings, then sure...with regexps it might be possible to 
make it work, but it doesn't sound particularly maintainable.  I will 
stick with my shell script until python gets a regexp engine of 
equivalent performance.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-10 Thread Kris Kennaway

J. Cliff Dyer wrote:

On Wed, 2008-07-09 at 12:29 -0700, samwyse wrote:

On Jul 8, 11:01 am, Kris Kennaway [EMAIL PROTECTED] wrote:

samwyse wrote:

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. 

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous.  And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2).  156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs


Sounds like a good strategy would be to find the smallest chunk of the
file that matches can't cross, and iterate your search on units of those
chunks.  For example, if none of your regexes cross line boundaries,
search each line of the file individually.  That may help turn around
the speed degradation you're seeing.


That's what I'm doing.  I've also tried various other things like 
mmapping the file and searching it at once, etc, but almost all of the 
time is spent in the regexp engine so optimizing other things only gives 
marginal improvement.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: multithreading in python ???

2008-07-10 Thread Kris Kennaway

Laszlo Nagy wrote:

Abhishek Asthana wrote:


Hi all ,

I  have large set of data computation and I want to break it into 
small batches and assign it to different threads .I am implementing it 
in python only. Kindly help what all libraries should I refer to 
implement the multithreading in python.


You should not do this. Python can handle multiple threads but they 
always use the same processor. (at least in CPython.) In order to take 
advantage of multiple processors, use different processes.


Only partly true.  Threads executing in the python interpreter are 
serialized and only run on a single CPU at a time.  Depending on what 
modules you use they may be able to operate independently on multiple 
CPUs.  The term to research is GIL (Global Interpreter Lock).  There 
are many webpages discussing it, and the alternative strategies you can use.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-09 Thread Kris Kennaway

John Machin wrote:


Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).



Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?


Trivial stuff like:

  (Str('error in pkg_delete'), ('mtree', 'mtree')),
  (Str('filesystem was touched prior to .make install'), 
('mtree', 'mtree')),

  (Str('list of extra files and directories'), ('mtree', 'mtree')),
  (Str('list of files present before this port was installed'), 
('mtree', 'mtree')),
  (Str('list of filesystem changes from before and after'), 
('mtree', 'mtree')),


  (re('Configuration .* not supported'), ('arch', 'arch')),

  (re('(configure: error:|Script.*configure.*failed 
unexpectedly|script.*failed: here are the contents of)'),

   ('configure_error', 'configure')),
...

There are about 150 of them and I want to find which is the first match 
in a text file that ranges from a few KB up to 512MB in size.


 How large is large?

What kind of text?


It's compiler/build output.


Instead of grep, you might like to try nrgrep ... google(nrgrep
Navarro Raffinot): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.


Thanks, looks interesting but I don't think it is the best fit here.  I 
would like to avoid spawning hundreds of processes to process each file 
(since I have tens of thousands of them to process).


Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-09 Thread Kris Kennaway

Jeroen Ruigrok van der Werven wrote:

-On [20080709 14:08], Kris Kennaway ([EMAIL PROTECTED]) wrote:

It's compiler/build output.


Sounds like the FreeBSD ports build cluster. :)


Yes indeed!


Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.


I am pretty sure the problem is algorithmic, not bad byte code :)  If it 
was a matter of a few % then that is in the scope of compiler tweaks, 
but we're talking orders of magnitude.


Kris


Also, a while ago I did tests with various GCC compilers and their effect on
Python running time as well as Intel's cc. Intel won on (nearly) all
accounts, meaning it was faster overall.

From the top of my mind: GCC 4.1.x was faster than GCC 4.2.x.



--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-09 Thread Kris Kennaway

samwyse wrote:

On Jul 8, 11:01 am, Kris Kennaway [EMAIL PROTECTED] wrote:

samwyse wrote:



You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. 



Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute).  If the matching was fast then I could possibly pickle the
lexer though (but it's not).


That's funny, the compilation is almost instantaneous for me.


My lexicon was quite a bit bigger, containing about 150 strings and regexps.


However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous.  And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2).  156000*'a' would probably take 8 hours.
Here are my results:


The docs say it is supposed to be linear in the file size ;-) ;-(

Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-08 Thread Kris Kennaway

samwyse wrote:

On Jul 4, 6:43 am, Henning_Thornblad [EMAIL PROTECTED]
wrote:

What can be the cause of the large difference between re.search and
grep?



While doing a simple grep:
grep '[^ =]*/' input  (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?


You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. 


Very interesting!  Thanks very much for the pointer.

Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-08 Thread Kris Kennaway

samwyse wrote:

On Jul 4, 6:43 am, Henning_Thornblad [EMAIL PROTECTED]
wrote:

What can be the cause of the large difference between re.search and
grep?



While doing a simple grep:
grep '[^ =]*/' input  (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?


You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. 

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' ='))+Str('/'),  TEXT),
(AnyBut('\n'), IGNORE),
])
filename = my_file.txt
f = open(filename, r)
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break


Hmm, unfortunately it's still orders of magnitude slower than grep in my 
own application that involves matching lots of strings and regexps 
against large files (I killed it after 400 seconds, compared to 1.5 for 
grep), and that's leaving aside the much longer compilation time (over a 
minute).  If the matching was fast then I could possibly pickle the 
lexer though (but it's not).


Kris

Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: re.search much slower then grep on some regular expressions

2008-07-07 Thread Kris Kennaway

Paddy wrote:

On Jul 4, 1:36 pm, Peter Otten [EMAIL PROTECTED] wrote:

Henning_Thornblad wrote:

What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)




This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=
for a in range(156000):
row+=a
print re.search('[^ =]*/',row)
While doing a simple grep:
grep '[^ =]*/' input  (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ =]*/', row) if / in row else None

might be good enough.

Peter


It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but 


I can and do :-)

It's a major problem that regular expression parsing in python has 
exponential complexity when polynomial algorithms (for a subset of 
regexp expressions, e.g. excluding back-references) are well-known.


It rules out using python for entire classes of applications where 
regexp parsing is on the critical path.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: Bit substring search

2008-06-25 Thread Kris Kennaway

Scott David Daniels wrote:

Kris Kennaway wrote:
Thanks for the pointers, I think a C extension will end up being the 
way to go, unless someone has beaten me to it and I just haven't found 
it yet.


Depending on the pattern length you are targeting, it may be fastest to
increase the out-of-loop work.  For a 40-bit string, build an 8-target
Aho-Corasick machine, and at each match check the endpoints.  This will
only work well if 40 bits is at the low end of what you are hunting for.


Thanks, I wasn't aware of Aho-Corasick.

Kris

--
http://mail.python.org/mailman/listinfo/python-list


Bit substring search

2008-06-24 Thread Kris Kennaway
I am trying to parse a bit-stream file format (bzip2) that does not have 
byte-aligned record boundaries, so I need to do efficient matching of 
bit substrings at arbitrary bit offsets.


Is there a package that can do this?  This one comes close:

http://ilan.schnell-web.net/prog/bitarray/

but it only supports single bit substring match.

Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: Bit substring search

2008-06-24 Thread Kris Kennaway

[EMAIL PROTECTED] wrote:

Kris Kennaway:

I am trying to parse a bit-stream file format (bzip2) that does not have
byte-aligned record boundaries, so I need to do efficient matching of
bit substrings at arbitrary bit offsets.
Is there a package that can do this?


You may take a look at Hachoir or some other modules:
http://hachoir.org/wiki/hachoir-core
http://pypi.python.org/pypi/construct/2.00


Thanks.  hachoir also comes close, but it also doesnt seem to be able to 
match substrings at a bit level (e.g. the included bzip2 parser just 
reads the header and hands the entire file off to libbzip2 to extract 
data from).


construct exports a bit stream but it's again pure python and matching 
substrings will be slow.  It will need C support to do that efficiently.



http://pypi.python.org/pypi/FmtRW/20040603
Etc. More:
http://pypi.python.org/pypi?%3Aaction=searchterm=binary


Unfortunately I didnt find anything else useful here yet :(

Kris

--
http://mail.python.org/mailman/listinfo/python-list


Re: Bit substring search

2008-06-24 Thread Kris Kennaway

[EMAIL PROTECTED] wrote:

Kris Kennaway:

Unfortunately I didnt find anything else useful here yet :(


I see, I'm sorry, I have found hachoir quite nice in the past. Maybe
there's no really efficient way to do it with Python, but you can
create a compiled extension, so you can see if it's fast enough for
your purposes.
To create such extension you can:
- One thing that requires very little time is to create an extension
with ShedSkin, once installed it just needs Python code.
- Cython (ex-Pyrex) too may be okay, but it's a bit trikier on Windows
machines.
- Using Pyd to create a D extension for Python is often the faster way
I have found to create extensions. I need just few minutes to create
them this way. But you need to know a bit of D.
- Then, if you want you can write a C extension, but if you have not
done it before you may need some hours to make it work.


Thanks for the pointers, I think a C extension will end up being the way 
to go, unless someone has beaten me to it and I just haven't found it yet.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


ZFS bindings

2008-06-18 Thread Kris Kennaway
Is anyone aware of python bindings for ZFS?  I just want to replicate 
(or at least wrap) the command line functionality for interacting with 
snapshots etc.  Searches have turned up nothing.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: Looking for lots of words in lots of files

2008-06-18 Thread Kris Kennaway

Calvin Spealman wrote:

Upload, wait, and google them.

Seriously tho, aside from using a real indexer, I would build a set of 
the words I'm looking for, and then loop over each file, looping over 
the words and doing quick checks for containment in the set. If so, add 
to a dict of file names to list of words found until the list hits 10 
length. I don't think that would be a complicated solution and it 
shouldn't be terrible at performance.


If you need to run this more than once, use an indexer.

If you only need to use it once, use an indexer, so you learn how for 
next time.


If you can't use an indexer, and performance matters, evaluate using 
grep and a shell script.  Seriously.


grep is a couple of orders of magnitude faster at pattern matching 
strings in files (and especially regexps) than python is.  Even if you 
are invoking grep multiple times it is still likely to be faster than a 
maximally efficient single pass over the file in python.  This 
realization was disappointing to me :)


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: Faster I/O in a script

2008-06-04 Thread Kris Kennaway

Gary Herron wrote:

[EMAIL PROTECTED] wrote:

On Jun 2, 2:08 am, kalakouentin [EMAIL PROTECTED] wrote:

 

 Do you know a way to actually load my data in a more
batch-like way so I will avoid the constant line by line reading?



If your files will fit in memory, you can just do

text = file.readlines()

and Python will read the entire file into a list of strings named
'text,' where each item in the list corresponds to one 'line' of the
file.
  


No that won't help.  That has to do *all* the same work (reading blocks 
and finding line endings) as the iterator PLUS allocate and build a list.

Better to just use the iterator.

for line in file:
 ...


Actually this *can* be much slower.  Suppose I want to search a file to 
see if a substring is present.


st = some substring that is not actually in the file
f = 50 MB log file

Method 1:

for i in file(f):
if st in i:
break

-- 0.472416 seconds

Method 2:

Read whole file:

fh = file(f)
rl = fh.read()
fh.close()

-- 0.098834 seconds

st in rl test -- 0.037251 (total: .136 seconds)

Method 3:

mmap the file:

mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)
st in mm test -- 3.589938 (-- see my post the other day)

mm.find(st) -- 0.186895

Summary:

If you can afford the memory, it can be more efficient (more than 3 
times faster in this example) to read the file into memory and process 
it at once (if possible).


Mmapping the file and processing it at once is roughly as fast (I didnt 
measure the difference carefully), but has the advantage that if there 
are parts of the file you do not touch you don't fault them into memory. 
 You could also play more games and mmap chunks at a time to limit the 
memory use (but you'd have to be careful with mmapping that doesn't 
match record boundaries).


Kris
--
http://mail.python.org/mailman/listinfo/python-list


Re: UNIX credential passing

2008-05-30 Thread Kris Kennaway

Sebastian 'lunar' Wiesner wrote:

[ Kris Kennaway [EMAIL PROTECTED] ]


I want to make use of UNIX credential passing on a local domain socket
to verify the identity of a user connecting to a privileged service.
However it looks like the socket module doesn't implement
sendmsg/recvmsg wrappers, and I can't find another module that does this
either.  Is there something I have missed?


http://pyside.blogspot.com/2007/07/unix-socket-credentials-with-python.html

Illustrates, how to use socket credentials without sendmsg/recvmsg and so
without any need for patching.




Thanks to both you and Paul for your suggestions.  For the record, the 
URL above is linux-specific, but it put me on the right track.  Here is 
an equivalent FreeBSD implementation:


def getpeereid(sock):
 Get peer credentials on a UNIX domain socket.

Returns a nested tuple: (uid, (gids)) 

LOCAL_PEERCRED = 0x001
NGROUPS = 16

#struct xucred {
#u_int   cr_version; /* structure layout version */
#uid_t   cr_uid; /* effective user id */
#short   cr_ngroups; /* number of groups */
#gid_t   cr_groups[NGROUPS]; /* groups */
#void*_cr_unused1;   /* compatibility with old ucred */
#};

xucred_fmt = '2ih16iP'
res = tuple(struct.unpack(xucred_fmt, sock.getsockopt(0, 
LOCAL_PEERCRED, struct.calcsize(xucred_fmt


# Check this is the above version of the structure
if res[0] != 0:
raise OSError

return (res[1], res[3:3+res[2]])


Kris
--
http://mail.python.org/mailman/listinfo/python-list


mmap class has slow in operator

2008-05-29 Thread Kris Kennaway

If I do the following:

def mmap_search(f, string):
fh = file(f)
mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)

return mm.find(string)

def mmap_is_in(f, string):
fh = file(f)
mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)

return string in mm

then a sample mmap_search() call on a 50MB file takes 0.18 seconds, but 
the mmap_is_in() call takes 6.6 seconds.  Is the mmap class missing an 
operator and falling back to a slow default implementation?  Presumably 
I can implement the latter in terms of the former.


Kris
--
http://mail.python.org/mailman/listinfo/python-list


UNIX credential passing

2008-05-29 Thread Kris Kennaway
I want to make use of UNIX credential passing on a local domain socket 
to verify the identity of a user connecting to a privileged service. 
However it looks like the socket module doesn't implement 
sendmsg/recvmsg wrappers, and I can't find another module that does this 
either.  Is there something I have missed?


Kris
--
http://mail.python.org/mailman/listinfo/python-list