Re: removing duplication from a huge list.

2009-03-03 Thread Adam Olsen
On Feb 27, 9:55 am, Falcolas wrote: > If order did matter, and the list itself couldn't be stored in memory, > I would personally do some sort of hash of each item (or something as > simple as first 5 bytes, last 5 bytes and length), keeping a reference > to which item the hash belongs, sort and i

Re: removing duplication from a huge list.

2009-03-03 Thread bearophileHUGS
odeits: > Although this is true, that is more of an answer to the question "How > do i remove duplicates from a huge list in Unix?". Don't you like cygwin? Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

2009-02-27 Thread Paul Rubin
Tim Rowe writes: > We were told in the original question: more than 15 million records, > and it won't all fit into memory. So your observation is pertinent. That is not terribly many records by today's standards. The knee-jerk approach is to sort them externally, then make a linear pass skippin

Re: removing duplication from a huge list.

2009-02-27 Thread Steve Holden
Falcolas wrote: > On Feb 27, 10:07 am, Steve Holden wrote: >> Assuming no duplicates, how does this help? You still have to verify >> collisions. > > Absolutely. But a decent hashing function (particularly since you know > quite a bit about the data beforehand) will have very few collisions > (th

Re: removing duplication from a huge list.

2009-02-27 Thread Falcolas
On Feb 27, 10:07 am, Steve Holden wrote: > Assuming no duplicates, how does this help? You still have to verify > collisions. Absolutely. But a decent hashing function (particularly since you know quite a bit about the data beforehand) will have very few collisions (theoretically no collisions, a

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Rowe
2009/2/27 Steve Holden : > Assuming no duplicates, how does this help? You still have to verify > collisions. > >> Pretty brutish and slow, but it's the first algorithm which comes to >> mind. Of course, I'm assuming that the list items are long enough to >> warrant using a hash and not the values

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Chase
How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file. We were told in the original question: more than 15 million records, and it won't all fit into me

Re: removing duplication from a huge list.

2009-02-27 Thread Steve Holden
Falcolas wrote: > On Feb 27, 8:33 am, Tim Rowe wrote: >> 2009/2/27 odeits : >> >>> How big of a list are we talking about? If the list is so big that the >>> entire list cannot fit in memory at the same time this approach wont >>> work e.g. removing duplicate lines from a very large file. >> We we

Re: removing duplication from a huge list.

2009-02-27 Thread Falcolas
On Feb 27, 8:33 am, Tim Rowe wrote: > 2009/2/27 odeits : > > > How big of a list are we talking about? If the list is so big that the > > entire list cannot fit in memory at the same time this approach wont > > work e.g. removing duplicate lines from a very large file. > > We were told in the orig

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Rowe
2009/2/27 odeits : > How big of a list are we talking about? If the list is so big that the > entire list cannot fit in memory at the same time this approach wont > work e.g. removing duplicate lines from a very large file. We were told in the original question: more than 15 million records, and

Re: removing duplication from a huge list.

2009-02-27 Thread odeits
On Feb 27, 1:18 am, Stefan Behnel wrote: > bearophileh...@lycos.com wrote: > > odeits: > >> How big of a list are we talking about? If the list is so big that the > >> entire list cannot fit in memory at the same time this approach wont > >> work e.g. removing duplicate lines from a very large fil

Re: removing duplication from a huge list.

2009-02-27 Thread Stefan Behnel
bearophileh...@lycos.com wrote: > odeits: >> How big of a list are we talking about? If the list is so big that the >> entire list cannot fit in memory at the same time this approach wont >> work e.g. removing duplicate lines from a very large file. > > If the data are lines of a file, and keeping

Re: removing duplication from a huge list.

2009-02-27 Thread bearophileHUGS
odeits: > How big of a list are we talking about? If the list is so big that the > entire list cannot fit in memory at the same time this approach wont > work e.g. removing duplicate lines from a very large file. If the data are lines of a file, and keeping the original order isn't important, then

Re: removing duplication from a huge list.

2009-02-26 Thread odeits
On Feb 26, 9:15 pm, Chris Rebert wrote: > On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson > wrote: > > Shanmuga Rajan gmail.com> writes: > > >> f any one suggests better solution then i will be very happy.Advance thanks > > for any help.Shan > > > Use a set. > > To expand on that a bit: > >

Re: removing duplication from a huge list.

2009-02-26 Thread Chris Rebert
On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson wrote: > Shanmuga Rajan gmail.com> writes: > >> f any one suggests better solution then i will be very happy.Advance thanks > for any help.Shan > > Use a set. To expand on that a bit: counted_recs = set(rec[0] for rec in some_fun()) #or in Pyth

Re: removing duplication from a huge list.

2009-02-26 Thread Benjamin Peterson
Shanmuga Rajan gmail.com> writes: > f any one suggests better solution then i will be very happy.Advance thanks for any help.Shan Use a set. -- http://mail.python.org/mailman/listinfo/python-list

removing duplication from a huge list.

2009-02-26 Thread Shanmuga Rajan
Hi I have a list of Records with some details.(more than 15 million records) with duplication I need to iterate through every record and need to eliminate duplicate records. Currently i am using a script like this. counted_recs = [ ] x = some_fun() # will return a generator, this generator