Re: Creating Long Lists

2011-02-23 Thread Jorgen Grahn
On Tue, 2011-02-22, Ben Finney wrote:
 Kelson Zawack zawack...@gis.a-star.edu.sg writes:

 I have a large (10gb) data file for which I want to parse each line
 into an object and then append this object to a list for sorting and
 further processing.

 What is the nature of the further processing?

 Does that further processing access the items sequentially? If so, they
 don't all need to be in memory at once, and you can produce them with a
 generator URL:http://docs.python.org/glossary.html#term-generator.

He mentioned sorting them -- you need all of them for that.

If that's the *only* such use, I'd experiment with writing them as
sortable text to file, and run GNU sort (the Unix utility) on the file.
It seems to have a clever file-backed sort algorithm.

/Jorgen

-- 
  // Jorgen Grahn grahn@  Oo  o.   .  .
\X/ snipabacken.se   O  o   .
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-23 Thread Tim Wintle
On Wed, 2011-02-23 at 13:57 +, Jorgen Grahn wrote:
 If that's the *only* such use, I'd experiment with writing them as
 sortable text to file, and run GNU sort (the Unix utility) on the file.
 It seems to have a clever file-backed sort algorithm.

+1 - and experiment with the different flags to sort (compression of
intermediate results, intermediate batch size, etc)

Tim


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-22 Thread Kelson Zawack
The answer it turns out is the garbage collector.  When I disable the
garbage collector before the loop that loads the data into the list
and then enable it after the loop the program runs without issue.
This raises a question though, can the logic of the garbage collector
be changed so that it is not triggered in instances like this were you
really do want to put lots and lots of stuff in memory.  Turning on
and off the garbage collector is not a big deal, but it would
obviously be nicer not to have to.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-22 Thread Ben Finney
Kelson Zawack zawack...@gis.a-star.edu.sg writes:

 This raises a question though, can the logic of the garbage collector
 be changed so that it is not triggered in instances like this were you
 really do want to put lots and lots of stuff in memory.

Have you considered using a more specialised data type for such large
data sets, such as ‘array.array’ or the NumPy array types?

-- 
 \ “True greatness is measured by how much freedom you give to |
  `\  others, not by how much you can coerce others to do what you |
_o__)   want.” —Larry Wall |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-22 Thread Peter Otten
Kelson Zawack wrote:

 The answer it turns out is the garbage collector.  When I disable the
 garbage collector before the loop that loads the data into the list
 and then enable it after the loop the program runs without issue.
 This raises a question though, can the logic of the garbage collector
 be changed so that it is not triggered in instances like this were you
 really do want to put lots and lots of stuff in memory.  Turning on
 and off the garbage collector is not a big deal, but it would
 obviously be nicer not to have to.

What Python version are you using? The garbage collection heuristic has been 
tweaked in 2.7, see

http://svn.python.org/view/python/trunk/Modules/gcmodule.c?r1=67832r2=68462
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-22 Thread Kelson Zawack
I am using python 2.6.2, so it may no longer be a problem.

I am open to using another data type, but the way I read the
documentation array.array only supports numeric types, not arbitrary
objects.  I also tried playing around with numpy arrays, albeit for
only a short time, and it seems that although they do support
arbitrary objects they seem to be geared toward numbers as well and I
found it cumbersome to manipulate objects with them.  It could be
though that if I understood them better they would work fine.  Also do
numpy arrays support sorting arbitrary objects, I only saw a method
that sorts numbers.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-22 Thread Terry Reedy

On 2/22/2011 4:40 AM, Kelson Zawack wrote:

The answer it turns out is the garbage collector.  When I disable the
garbage collector before the loop that loads the data into the list
and then enable it after the loop the program runs without issue.
This raises a question though, can the logic of the garbage collector
be changed so that it is not triggered in instances like this were you
really do want to put lots and lots of stuff in memory.  Turning on
and off the garbage collector is not a big deal, but it would
obviously be nicer not to have to.


Heuristics, by their very nature, are not correct in all situations.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Creating Long Lists

2011-02-21 Thread Kelson Zawack
I have a large (10gb) data file for which I want to parse each line into 
an object and then append this object to a list for sorting and further 
processing.  I have noticed however that as the length of the list 
increases the rate at which objects are added to it decreases 
dramatically.  My first thought was that  I was nearing the memory 
capacity of the machine and the decrease in performance was due to the 
os swapping things in and out of memory.  When I looked at the memory 
usage this was not the case.  My process was the only job running and 
was consuming 40gb of the the total 130gb and no swapping processes were 
running.  To make sure there was not some problem with the rest of my 
code, or the servers file system, I ran my program again as it was but 
without the line that was appending items to the list and it completed 
without problem indicating that the decrease in performance is the 
result of some part of the process of appending to the list.  Since 
other people have observed this problem as well 
(http://tek-tips.com/viewthread.cfm?qid=1096178page=13,  
http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i) 
I did not bother to further analyze or benchmark it.  Since the answers 
in the above forums do not seem very definitive  I thought  I would 
inquire here about what the reason for this decrease in performance is, 
and if there is a way, or another data structure, that would avoid this 
problem.


--
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-21 Thread alex23
On Feb 22, 12:57 pm, Kelson Zawack zawack...@gis.a-star.edu.sg
wrote:
 I did not bother to further analyze or benchmark it.  Since the answers
 in the above forums do not seem very definitive  I thought  I would
 inquire here about what the reason for this decrease in performance is,
 and if there is a way, or another data structure, that would avoid this
 problem.

The first link is 6 years old and refers to Python 2.4. Unless you're
using 2.4 you should probably ignore it.

The first answer on the stackoverflow link was accepted by the poster
as resolving his issue. Try disabling garbage collection.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-21 Thread John Bokma
alex23 wuwe...@gmail.com writes:

 On Feb 22, 12:57 pm, Kelson Zawack zawack...@gis.a-star.edu.sg
 wrote:
 I did not bother to further analyze or benchmark it.  Since the answers
 in the above forums do not seem very definitive  I thought  I would
 inquire here about what the reason for this decrease in performance is,
 and if there is a way, or another data structure, that would avoid this
 problem.

 The first link is 6 years old and refers to Python 2.4. Unless you're
 using 2.4 you should probably ignore it.

 The first answer on the stackoverflow link was accepted by the poster
 as resolving his issue. Try disabling garbage collection.

I just read http://bugs.python.org/issue4074 which discusses a patch
that has been included 2 years ago. So using a recent Python 2.x also
doesn't have this issue?

-- 
John Bokma   j3b

Blog: http://johnbokma.com/Facebook: http://www.facebook.com/j.j.j.bokma
Freelance Perl  Python Development: http://castleamber.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-21 Thread Dan Stromberg
On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack
zawack...@gis.a-star.edu.sgwrote:

 I have a large (10gb) data file for which I want to parse each line into an
 object and then append this object to a list for sorting and further
 processing.  I have noticed however that as the length of the list increases
 the rate at which objects are added to it decreases dramatically.  My first
 thought was that  I was nearing the memory capacity of the machine and the
 decrease in performance was due to the os swapping things in and out of
 memory.  When I looked at the memory usage this was not the case.  My
 process was the only job running and was consuming 40gb of the the total
 130gb and no swapping processes were running.  To make sure there was not
 some problem with the rest of my code, or the servers file system, I ran my
 program again as it was but without the line that was appending items to the
 list and it completed without problem indicating that the decrease in
 performance is the result of some part of the process of appending to the
 list.  Since other people have observed this problem as well (
 http://tek-tips.com/viewthread.cfm?qid=1096178page=13,
 http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i)
 I did not bother to further analyze or benchmark it.  Since the answers in
 the above forums do not seem very definitive  I thought  I would inquire
 here about what the reason for this decrease in performance is, and if there
 is a way, or another data structure, that would avoid this 
 problem.http://mail.python.org/mailman/listinfo/python-list


Do you have 130G of physical RAM, or 130G of virtual memory?  That makes a
big difference.  (Yeah, I know, 130G of physical RAM is probably pretty rare
today)

Disabling garbage collection is a good idea, but if you don't have well over
10G of physical RAM, you'd probably better also use a (partially) disk-based
sort.  To do otherwise would pretty much beg for swapping and a large
slowdown.

Merge sort works very well for very large datasets.
http://en.wikipedia.org/wiki/Merge_sort  Just make your sublists be disk
files, not in-memory lists - until you get down to a small enough sublist
that you can sort it in memory, without thrashing.  Timsort (list_.sort())
is excellent for in memory sorting.

Actually, GNU sort is very good at sorting huge datasets - you could
probably just open a subprocess to it, as long as you can make your data fit
the line-oriented model GNU sort expects, and you have enough temporary disk
space.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-21 Thread Ben Finney
Kelson Zawack zawack...@gis.a-star.edu.sg writes:

 I have a large (10gb) data file for which I want to parse each line
 into an object and then append this object to a list for sorting and
 further processing.

What is the nature of the further processing?

Does that further processing access the items sequentially? If so, they
don't all need to be in memory at once, and you can produce them with a
generator URL:http://docs.python.org/glossary.html#term-generator.

Note that, if you just want lines of text from a file, the file object
itself is a generator for the lines of text within it.

If, on the other hand, you need arbitrary access all over that large
data set, you probably want a data type better suited. The standard
library has the ‘array’ module for this purpose; the third-party NumPy
library provides even more power.

-- 
 \   “Remember: every member of your ‘target audience’ also owns a |
  `\   broadcasting station. These ‘targets’ can shoot back.” —Michael |
_o__)   Rathbun to advertisers, news.admin.net-abuse.email |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Creating Long Lists

2011-02-21 Thread Dan Stromberg
On Mon, Feb 21, 2011 at 7:24 PM, Dan Stromberg drsali...@gmail.com wrote:


 On Mon, Feb 21, 2011 at 6:57 PM, Kelson Zawack 
 zawack...@gis.a-star.edu.sg wrote:

 I have a large (10gb) data file for which I want to parse each line into
 an object and then append this object to a list for sorting and further
 processing.  I have noticed however that as the length of the list increases
 the rate at which objects are added to it decreases dramatically.  My first
 thought was that  I was nearing the memory capacity of the machine and the
 decrease in performance was due to the os swapping things in and out of
 memory.  When I looked at the memory usage this was not the case.  My
 process was the only job running and was consuming 40gb of the the total
 130gb and no swapping processes were running.  To make sure there was not
 some problem with the rest of my code, or the servers file system, I ran my
 program again as it was but without the line that was appending items to the
 list and it completed without problem indicating that the decrease in
 performance is the result of some part of the process of appending to the
 list.  Since other people have observed this problem as well (
 http://tek-tips.com/viewthread.cfm?qid=1096178page=13,
 http://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower-i)
 I did not bother to further analyze or benchmark it.  Since the answers in
 the above forums do not seem very definitive  I thought  I would inquire
 here about what the reason for this decrease in performance is, and if there
 is a way, or another data structure, that would avoid this 
 problem.http://mail.python.org/mailman/listinfo/python-list


 Do you have 130G of physical RAM, or 130G of virtual memory?  That makes a
 big difference.  (Yeah, I know, 130G of physical RAM is probably pretty rare
 today)

 Disabling garbage collection is a good idea, but if you don't have well
 over 10G of physical RAM, you'd probably better also use a (partially)
 disk-based sort.  To do otherwise would pretty much beg for swapping and a
 large slowdown.

 Merge sort works very well for very large datasets.
 http://en.wikipedia.org/wiki/Merge_sort  Just make your sublists be disk
 files, not in-memory lists - until you get down to a small enough sublist
 that you can sort it in memory, without thrashing.  Timsort (list_.sort())
 is excellent for in memory sorting.

 Actually, GNU sort is very good at sorting huge datasets - you could
 probably just open a subprocess to it, as long as you can make your data fit
 the line-oriented model GNU sort expects, and you have enough temporary disk
 space.


Depending on what you're doing after the sort, you might also look at
bsddb.btopen
-- 
http://mail.python.org/mailman/listinfo/python-list