Re: removing duplication from a huge list.

2009-03-03 Thread bearophileHUGS
odeits: Although this is true, that is more of an answer to the question How do i remove duplicates from a huge list in Unix?. Don't you like cygwin? Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

2009-03-03 Thread Adam Olsen
On Feb 27, 9:55 am, Falcolas garri...@gmail.com wrote: If order did matter, and the list itself couldn't be stored in memory, I would personally do some sort of hash of each item (or something as simple as first 5 bytes, last 5 bytes and length), keeping a reference to which item the hash

Re: removing duplication from a huge list.

2009-02-27 Thread bearophileHUGS
odeits: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file. If the data are lines of a file, and keeping the original order isn't important, then

Re: removing duplication from a huge list.

2009-02-27 Thread Stefan Behnel
bearophileh...@lycos.com wrote: odeits: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file. If the data are lines of a file, and keeping the

Re: removing duplication from a huge list.

2009-02-27 Thread odeits
On Feb 27, 1:18 am, Stefan Behnel stefan...@behnel.de wrote: bearophileh...@lycos.com wrote: odeits: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Rowe
2009/2/27 odeits ode...@gmail.com: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file. We were told in the original question: more than 15 million

Re: removing duplication from a huge list.

2009-02-27 Thread Falcolas
On Feb 27, 8:33 am, Tim Rowe digi...@gmail.com wrote: 2009/2/27 odeits ode...@gmail.com: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file.

Re: removing duplication from a huge list.

2009-02-27 Thread Steve Holden
Falcolas wrote: On Feb 27, 8:33 am, Tim Rowe digi...@gmail.com wrote: 2009/2/27 odeits ode...@gmail.com: How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Chase
How big of a list are we talking about? If the list is so big that the entire list cannot fit in memory at the same time this approach wont work e.g. removing duplicate lines from a very large file. We were told in the original question: more than 15 million records, and it won't all fit into

Re: removing duplication from a huge list.

2009-02-27 Thread Tim Rowe
2009/2/27 Steve Holden st...@holdenweb.com: Assuming no duplicates, how does this help? You still have to verify collisions. Pretty brutish and slow, but it's the first algorithm which comes to mind. Of course, I'm assuming that the list items are long enough to warrant using a hash and not

Re: removing duplication from a huge list.

2009-02-27 Thread Falcolas
On Feb 27, 10:07 am, Steve Holden st...@holdenweb.com wrote: Assuming no duplicates, how does this help? You still have to verify collisions. Absolutely. But a decent hashing function (particularly since you know quite a bit about the data beforehand) will have very few collisions

Re: removing duplication from a huge list.

2009-02-27 Thread Steve Holden
Falcolas wrote: On Feb 27, 10:07 am, Steve Holden st...@holdenweb.com wrote: Assuming no duplicates, how does this help? You still have to verify collisions. Absolutely. But a decent hashing function (particularly since you know quite a bit about the data beforehand) will have very few

Re: removing duplication from a huge list.

2009-02-27 Thread Paul Rubin
Tim Rowe digi...@gmail.com writes: We were told in the original question: more than 15 million records, and it won't all fit into memory. So your observation is pertinent. That is not terribly many records by today's standards. The knee-jerk approach is to sort them externally, then make a

removing duplication from a huge list.

2009-02-26 Thread Shanmuga Rajan
Hi I have a list of Records with some details.(more than 15 million records) with duplication I need to iterate through every record and need to eliminate duplicate records. Currently i am using a script like this. counted_recs = [ ] x = some_fun() # will return a generator, this generator

Re: removing duplication from a huge list.

2009-02-26 Thread Benjamin Peterson
Shanmuga Rajan m.shanmugarajan at gmail.com writes: f any one suggests better solution then i will be very happy.Advance thanks for any help.Shan Use a set. -- http://mail.python.org/mailman/listinfo/python-list

Re: removing duplication from a huge list.

2009-02-26 Thread Chris Rebert
On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson benja...@python.org wrote: Shanmuga Rajan m.shanmugarajan at gmail.com writes: f any one suggests better solution then i will be very happy.Advance thanks for any help.Shan Use a set. To expand on that a bit: counted_recs = set(rec[0] for

Re: removing duplication from a huge list.

2009-02-26 Thread odeits
On Feb 26, 9:15 pm, Chris Rebert c...@rebertia.com wrote: On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson benja...@python.org wrote: Shanmuga Rajan m.shanmugarajan at gmail.com writes: f any one suggests better solution then i will be very happy.Advance thanks for any help.Shan Use