John Doe:
> Dr.Ruud:
>> John Doe:

>>> In constrast to the other presented solutions using a a hash
>>> slurping all data from the intput file, the advantage of my
>>> solution is that it is *capable to handle 300GB input files* even
>>> with 128MB RAM.
>>
>> Also in contrast is that your solution assumes sorted input, which
>> wasn't anounced.
>
> But it was present in the sample data. If it wasn't, I would have
> suggested a preliminary non-inmemory sort technique.

Also wasn't mentioned that the dataset is several orders of magnitude
larger than the sample data. The average of the sample is two values per
key, the highest key would be 'id999', that would add up to 5 KB + 2 * 6
KB is a maximum of about 20 KB, not 300 GB.
;)

I once did this in SQL. It uses a subquery to number the values (sorted)
per key, and a cross-tab query that uses those numbers for column names.
Inserting a column with the maximum index per key is then also easy.
When you know how to do tricks like that, you hardly ever need a
scripting language with even the largest sets of data. But that is just
what I believe.

-- 
Grtz, Ruud


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to