odeits:
Although this is true, that is more of an answer to the question How
do i remove duplicates from a huge list in Unix?.
Don't you like cygwin?
Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list
On Feb 27, 9:55 am, Falcolas garri...@gmail.com wrote:
If order did matter, and the list itself couldn't be stored in memory,
I would personally do some sort of hash of each item (or something as
simple as first 5 bytes, last 5 bytes and length), keeping a reference
to which item the hash
odeits:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
If the data are lines of a file, and keeping the original order isn't
important, then
bearophileh...@lycos.com wrote:
odeits:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
If the data are lines of a file, and keeping the
On Feb 27, 1:18 am, Stefan Behnel stefan...@behnel.de wrote:
bearophileh...@lycos.com wrote:
odeits:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very
2009/2/27 odeits ode...@gmail.com:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
We were told in the original question: more than 15 million
On Feb 27, 8:33 am, Tim Rowe digi...@gmail.com wrote:
2009/2/27 odeits ode...@gmail.com:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
Falcolas wrote:
On Feb 27, 8:33 am, Tim Rowe digi...@gmail.com wrote:
2009/2/27 odeits ode...@gmail.com:
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
We were told in the original question: more than 15 million records,
and it won't all fit into
2009/2/27 Steve Holden st...@holdenweb.com:
Assuming no duplicates, how does this help? You still have to verify
collisions.
Pretty brutish and slow, but it's the first algorithm which comes to
mind. Of course, I'm assuming that the list items are long enough to
warrant using a hash and not
On Feb 27, 10:07 am, Steve Holden st...@holdenweb.com wrote:
Assuming no duplicates, how does this help? You still have to verify
collisions.
Absolutely. But a decent hashing function (particularly since you know
quite a bit about the data beforehand) will have very few collisions
Falcolas wrote:
On Feb 27, 10:07 am, Steve Holden st...@holdenweb.com wrote:
Assuming no duplicates, how does this help? You still have to verify
collisions.
Absolutely. But a decent hashing function (particularly since you know
quite a bit about the data beforehand) will have very few
Tim Rowe digi...@gmail.com writes:
We were told in the original question: more than 15 million records,
and it won't all fit into memory. So your observation is pertinent.
That is not terribly many records by today's standards. The knee-jerk
approach is to sort them externally, then make a
Hi
I have a list of Records with some details.(more than 15 million records)
with duplication
I need to iterate through every record and need to eliminate duplicate
records.
Currently i am using a script like this.
counted_recs = [ ]
x = some_fun() # will return a generator, this generator
Shanmuga Rajan m.shanmugarajan at gmail.com writes:
f any one suggests better solution then i will be very happy.Advance thanks
for any help.Shan
Use a set.
--
http://mail.python.org/mailman/listinfo/python-list
On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson benja...@python.org wrote:
Shanmuga Rajan m.shanmugarajan at gmail.com writes:
f any one suggests better solution then i will be very happy.Advance thanks
for any help.Shan
Use a set.
To expand on that a bit:
counted_recs = set(rec[0] for
On Feb 26, 9:15 pm, Chris Rebert c...@rebertia.com wrote:
On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson benja...@python.org
wrote:
Shanmuga Rajan m.shanmugarajan at gmail.com writes:
f any one suggests better solution then i will be very happy.Advance thanks
for any help.Shan
Use
17 matches
Mail list logo