On Feb 27, 9:55 am, Falcolas wrote:
> If order did matter, and the list itself couldn't be stored in memory,
> I would personally do some sort of hash of each item (or something as
> simple as first 5 bytes, last 5 bytes and length), keeping a reference
> to which item the hash belongs, sort and i
odeits:
> Although this is true, that is more of an answer to the question "How
> do i remove duplicates from a huge list in Unix?".
Don't you like cygwin?
Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list
Tim Rowe writes:
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.
That is not terribly many records by today's standards. The knee-jerk
approach is to sort them externally, then make a linear pass skippin
Falcolas wrote:
> On Feb 27, 10:07 am, Steve Holden wrote:
>> Assuming no duplicates, how does this help? You still have to verify
>> collisions.
>
> Absolutely. But a decent hashing function (particularly since you know
> quite a bit about the data beforehand) will have very few collisions
> (th
On Feb 27, 10:07 am, Steve Holden wrote:
> Assuming no duplicates, how does this help? You still have to verify
> collisions.
Absolutely. But a decent hashing function (particularly since you know
quite a bit about the data beforehand) will have very few collisions
(theoretically no collisions, a
2009/2/27 Steve Holden :
> Assuming no duplicates, how does this help? You still have to verify
> collisions.
>
>> Pretty brutish and slow, but it's the first algorithm which comes to
>> mind. Of course, I'm assuming that the list items are long enough to
>> warrant using a hash and not the values
How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file.
We were told in the original question: more than 15 million records,
and it won't all fit into me
Falcolas wrote:
> On Feb 27, 8:33 am, Tim Rowe wrote:
>> 2009/2/27 odeits :
>>
>>> How big of a list are we talking about? If the list is so big that the
>>> entire list cannot fit in memory at the same time this approach wont
>>> work e.g. removing duplicate lines from a very large file.
>> We we
On Feb 27, 8:33 am, Tim Rowe wrote:
> 2009/2/27 odeits :
>
> > How big of a list are we talking about? If the list is so big that the
> > entire list cannot fit in memory at the same time this approach wont
> > work e.g. removing duplicate lines from a very large file.
>
> We were told in the orig
2009/2/27 odeits :
> How big of a list are we talking about? If the list is so big that the
> entire list cannot fit in memory at the same time this approach wont
> work e.g. removing duplicate lines from a very large file.
We were told in the original question: more than 15 million records,
and
On Feb 27, 1:18 am, Stefan Behnel wrote:
> bearophileh...@lycos.com wrote:
> > odeits:
> >> How big of a list are we talking about? If the list is so big that the
> >> entire list cannot fit in memory at the same time this approach wont
> >> work e.g. removing duplicate lines from a very large fil
bearophileh...@lycos.com wrote:
> odeits:
>> How big of a list are we talking about? If the list is so big that the
>> entire list cannot fit in memory at the same time this approach wont
>> work e.g. removing duplicate lines from a very large file.
>
> If the data are lines of a file, and keeping
odeits:
> How big of a list are we talking about? If the list is so big that the
> entire list cannot fit in memory at the same time this approach wont
> work e.g. removing duplicate lines from a very large file.
If the data are lines of a file, and keeping the original order isn't
important, then
On Feb 26, 9:15 pm, Chris Rebert wrote:
> On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson
> wrote:
> > Shanmuga Rajan gmail.com> writes:
>
> >> f any one suggests better solution then i will be very happy.Advance thanks
> > for any help.Shan
>
> > Use a set.
>
> To expand on that a bit:
>
>
On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson wrote:
> Shanmuga Rajan gmail.com> writes:
>
>> f any one suggests better solution then i will be very happy.Advance thanks
> for any help.Shan
>
> Use a set.
To expand on that a bit:
counted_recs = set(rec[0] for rec in some_fun())
#or in Pyth
Shanmuga Rajan gmail.com> writes:
> f any one suggests better solution then i will be very happy.Advance thanks
for any help.Shan
Use a set.
--
http://mail.python.org/mailman/listinfo/python-list
Hi
I have a list of Records with some details.(more than 15 million records)
with duplication
I need to iterate through every record and need to eliminate duplicate
records.
Currently i am using a script like this.
counted_recs = [ ]
x = some_fun() # will return a generator, this generator
17 matches
Mail list logo