Re: Sorting an extremely LARGE file

2011-08-07 Thread Shawn H Corey

On 11-08-07 11:28 AM, Ramprasad Prasad wrote:

I have a file that contains records of customer interaction
The first column of the file is the batch number(INT) , and other columns
are date time , close time etc etc

I have to sort the entire file in order of the first column .. but the
problem is that the file is extremely huge.

For the largest customer it contains 1100 million records and the file is
44GB !
how can I sort this big a file



First, consider putting it in a database.

Split the file into little ones, sort them, merge-sort them back together.


--
Just my 0.0002 million dollars worth,
  Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software:  Fail early & often.

Eliminate software piracy:  use only FLOSS.

"Make something worthwhile."  -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Shawn H Corey

On 11-08-07 11:46 AM, Ramprasad Prasad wrote:

I used a mysql database , but the order by clause used to hang the
process indefinitely
If I sort files in smaller chunks how can I merge them back ??



Please use "Reply All" when responding to a message on this list.

You need two temporary files and lots of disk space.

1. Open the first and second sorted files.
2. Read one record from each.
3. Write the lesser record to the first temporary file.
4. Read another record from the file where you got the record you wrote.
5. If not eof, goto 3.
6. Write the remaining of the other file to the end of the temporary file.

Repeat the above with the first temporary file and the third sorted 
file, writing the result to the second temporary file.


Repeat the above with the second temporary file and the fourth sorted 
file, writing the result to the first temporary file.


And so on...

Rename the final temporary file to your sorted file name.


--
Just my 0.0002 million dollars worth,
  Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software:  Fail early & often.

Eliminate software piracy:  use only FLOSS.

"Make something worthwhile."  -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Ramprasad Prasad
On 7 August 2011 21:24, Shawn H Corey  wrote:

> On 11-08-07 11:46 AM, Ramprasad Prasad wrote:
>
>> I used a mysql database , but the order by clause used to hang the
>> process indefinitely
>> If I sort files in smaller chunks how can I merge them back ??
>>
>>
> Please use "Reply All" when responding to a message on this list.
>
> You need two temporary files and lots of disk space.
>
> 1. Open the first and second sorted files.
> 2. Read one record from each.
> 3. Write the lesser record to the first temporary file.
> 4. Read another record from the file where you got the record you wrote.
> 5. If not eof, goto 3.
> 6. Write the remaining of the other file to the end of the temporary file.
>
> Repeat the above with the first temporary file and the third sorted file,
> writing the result to the second temporary file.
>
> Repeat the above with the second temporary file and the fourth sorted file,
> writing the result to the first temporary file.
>
> And so on...
>
> Rename the final temporary file to your sorted file name.
>
>
>
There would be a CPAN module already doing this ??


Re: Sorting an extremely LARGE file

2011-08-07 Thread Dr.Ruud

On 2011-08-07 17:28, Ramprasad Prasad wrote:


I have a file that contains records of customer interaction
The first column of the file is the batch number(INT) , and other columns
are date time , close time etc etc

I have to sort the entire file in order of the first column .. but the
problem is that the file is extremely huge.

For the largest customer it contains 1100 million records and the file is
44GB !
how can I sort this big a file


I would use MySQL.

An alternative is the Linux sort executable.

To split up the file, as Shawn suggested, you could use Perl.
Split for example based on a few initial characters.
Then sort each file independently, and concat them.
(BTW, are the rows representing fixed-width records?)

Using a database is fine for this. I think you must have been using it 
wrongly.


--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Rajeev Prasad
hi, you can try this: first get only that field (sed/awk/perl) whihc you want 
to sort on in a file. sort that file which i assume would be lot less in size 
then your current file/table. then run a loop on the main file using sorted 
file as variable.
 
here is the logic in shell:
 
awk '{print $}'  > tmp-file
 
sort 
 
for id in `cat `;do grep $id  >> 
sorted-large-file;done

From: Ramprasad Prasad 
To: Shawn H Corey 
Cc: Perl Beginners 
Sent: Sunday, August 7, 2011 11:01 AM
Subject: Re: Sorting an extremely LARGE file

On 7 August 2011 21:24, Shawn H Corey  wrote:

> On 11-08-07 11:46 AM, Ramprasad Prasad wrote:
>
>> I used a mysql database , but the order by clause used to hang the
>> process indefinitely
>> If I sort files in smaller chunks how can I merge them back ??
>>
>>
> Please use "Reply All" when responding to a message on this list.
>
> You need two temporary files and lots of disk space.
>
> 1. Open the first and second sorted files.
> 2. Read one record from each.
> 3. Write the lesser record to the first temporary file.
> 4. Read another record from the file where you got the record you wrote.
> 5. If not eof, goto 3.
> 6. Write the remaining of the other file to the end of the temporary file.
>
> Repeat the above with the first temporary file and the third sorted file,
> writing the result to the second temporary file.
>
> Repeat the above with the second temporary file and the fourth sorted file,
> writing the result to the first temporary file.
>
> And so on...
>
> Rename the final temporary file to your sorted file name.
>
>
>
There would be a CPAN module already doing this ??

Re: Sorting an extremely LARGE file

2011-08-07 Thread Paul Johnson
On Sun, Aug 07, 2011 at 08:58:14PM +0530, Ramprasad Prasad wrote:

> I have a file that contains records of customer interaction
> The first column of the file is the batch number(INT) , and other columns
> are date time , close time etc etc
> 
> I have to sort the entire file in order of the first column .. but the
> problem is that the file is extremely huge.
> 
> For the largest customer it contains 1100 million records and the file is
> 44GB !
> how can I sort this big a file

Is there any reason not to use the system sort?  GNU sort uses an
external R-way merge.  It's designed for this sort of thing.

-- 
Paul Johnson - p...@pjcj.net
http://www.pjcj.net

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread shawn wilson
On Aug 7, 2011 1:15 PM, "Paul Johnson"  wrote:
>
> On Sun, Aug 07, 2011 at 08:58:14PM +0530, Ramprasad Prasad wrote:
>
> > I have a file that contains records of customer interaction
> > The first column of the file is the batch number(INT) , and other
columns
> > are date time , close time etc etc
> >
> > I have to sort the entire file in order of the first column .. but the
> > problem is that the file is extremely huge.
> >
> > For the largest customer it contains 1100 million records and the file
is
> > 44GB !
> > how can I sort this big a file
>
> Is there any reason not to use the system sort?  GNU sort uses an
> external R-way merge.  It's designed for this sort of thing.
>

The Unix sort is pretty fast and it will work. The problem with it is that
it seems to buffer overflow somewhere between 2 and 4 gigs, IIRC. A database
is perfect for this. However, I think the problem was that mysql's order by
is slow as hell. It can be sped up (slightly) with an index. You might
consider postgresql as their order by /should/ be quite a bit faster. You
might also try mongo or couch - though you'll put the sort logic in the
script and I haven't used either in perl.

If you've already got it in a db, I'd create the index, start the query,
watch your resources get pegged, and wait. You'll get it eventually. :)


Re: Sorting an extremely LARGE file

2011-08-07 Thread Shawn H Corey

On 11-08-07 03:20 PM, shawn wilson wrote:

It can be sped up (slightly) with an index.


Indexes in SQL don't normally speed up sorting.  What they're best at is 
selecting a limited number of records, usually less than 10% of the 
total.  Otherwise, they just get in the way.


The best you can do with a database is to keep the table sorted by the 
key most commonly used.  This is different than an index.  An index is 
an additional file that records the keys and the offset to the record in 
the table file.  The index file is sorted by its key.



--
Just my 0.0002 million dollars worth,
  Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software:  Fail early & often.

Eliminate software piracy:  use only FLOSS.

"Make something worthwhile."  -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Rob Dixon

On 07/08/2011 20:30, Shawn H Corey wrote:

On 11-08-07 03:20 PM, shawn wilson wrote:


It can be sped up (slightly) with an index.


Indexes in SQL don't normally speed up sorting. What they're best at is
selecting a limited number of records, usually less than 10% of the
total. Otherwise, they just get in the way.

The best you can do with a database is to keep the table sorted by the
key most commonly used. This is different than an index. An index is an
additional file that records the keys and the offset to the record in
the table file. The index file is sorted by its key.


Exactly. So to sort a database in the order of its key field all that is
necessary is to read sequentially through the index and pull out the
corresponding record.

I would suggest that the OP could do this 'manually'. i.e. build a
separate index file with just the key fields and pointers into the
primary file. Once that is done the operation is trivial: even more so
if the primary file has fixed-length records (and if not I would like a
word with the person who decided on a 44G file that must be read
sequentially!).

Cheers,

Rob

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread shawn wilson
On Sun, Aug 7, 2011 at 15:58, Rob Dixon  wrote:
> On 07/08/2011 20:30, Shawn H Corey wrote:
>>
>> On 11-08-07 03:20 PM, shawn wilson wrote:
>>>
>>> It can be sped up (slightly) with an index.
>>
>> Indexes in SQL don't normally speed up sorting. What they're best at is
>> selecting a limited number of records, usually less than 10% of the
>> total. Otherwise, they just get in the way.
>>
>> The best you can do with a database is to keep the table sorted by the
>> key most commonly used. This is different than an index. An index is an
>> additional file that records the keys and the offset to the record in
>> the table file. The index file is sorted by its key.
>
> Exactly. So to sort a database in the order of its key field all that is
> necessary is to read sequentially through the index and pull out the
> corresponding record.
>
> I would suggest that the OP could do this 'manually'. i.e. build a
> separate index file with just the key fields and pointers into the
> primary file. Once that is done the operation is trivial: even more so
> if the primary file has fixed-length records (and if not I would like a
> word with the person who decided on a 44G file that must be read
> sequentially!).
>

i really do think it could be done in perl pretty easy:

my $idx;
while( <> ) {
 $idx->{ $csv->field }->{ scalar{ $idx->{ $csv->field } ) } = $.;
}

then you have a nice data structure of your values and duplicates
along with line numbers. you can then go and loop again and pull out
your lines.

i still think this is the wrong approach as it is in a db, should be
in a db and should never have been put in a 44G flat file in the first
place. but

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Uri Guttman
> "RP" == Rajeev Prasad  writes:

  RP> hi, you can try this: first get only that field (sed/awk/perl)
  RP> whihc you want to sort on in a file. sort that file which i assume
  RP> would be lot less in size then your current file/table. then run a
  RP> loop on the main file using sorted file as variable.

  RP>  
  RP> here is the logic in shell:
  RP>  
  RP> awk '{print $}'  > tmp-file
  RP>  
  RP> sort 
  RP>  

  RP> for id in `cat `;do grep $id  >> 
sorted-large-file;done

have you thought about the time this will take? you are doing an O( N**2
) grep there. you are looping over all N keys and then scanning the file
N lines for each key. that will take a very long time for such a large
file. as others have said, either use the sort utility or do a
merge/sort on the records. your way is effectively a slow bubble sort!

uri

-- 
Uri Guttman  --  uri AT perlhunter DOT com  ---  http://www.perlhunter.com --
  Perl Developer Recruiting and Placement Services  -
-  Perl Code Review, Architecture, Development, Training, Support ---

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Ramprasad Prasad
Using the system linux sort ... Does not help.
On my dual quad core machine , (8 gb ram) sort -n file takes 10
minutes and in the end produces no output.

when I put this data in mysql , there is an index on the order by
field ... But I guess keys don't help when you are selecting the
entire table.

I guess there is a serious need for re-architecting , rather than
create such monstrous files, but when people work with legacy systems
which worked fine when there was lower usage and now you tell then you
need a overhaul because the current system doesn't scale ... That
takes a lot of convincing

On 8/8/11, Uri Guttman  wrote:
>> "RP" == Rajeev Prasad  writes:
>
>   RP> hi, you can try this: first get only that field (sed/awk/perl)
>   RP> whihc you want to sort on in a file. sort that file which i assume
>   RP> would be lot less in size then your current file/table. then run a
>   RP> loop on the main file using sorted file as variable.
>
>   RP>
>   RP> here is the logic in shell:
>   RP>
>   RP> awk '{print $}'  > tmp-file
>   RP>
>   RP> sort 
>   RP>
>
>   RP> for id in `cat `;do grep $id  >>
> sorted-large-file;done
>
> have you thought about the time this will take? you are doing an O( N**2
> ) grep there. you are looping over all N keys and then scanning the file
> N lines for each key. that will take a very long time for such a large
> file. as others have said, either use the sort utility or do a
> merge/sort on the records. your way is effectively a slow bubble sort!
>
> uri
>
> --
> Uri Guttman  --  uri AT perlhunter DOT com  ---  http://www.perlhunter.com
> --
>   Perl Developer Recruiting and Placement Services
> -
> -  Perl Code Review, Architecture, Development, Training, Support
> ---
>

-- 
Sent from my mobile device

Thanks
Ram
  




n 

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-07 Thread Kenneth Wolcott
On Sun, Aug 7, 2011 at 22:10, Ramprasad Prasad  wrote:
>

[snip]

> I guess there is a serious need for re-architecting , rather than
> create such monstrous files, but when people work with legacy systems
> which worked fine when there was lower usage and now you tell then you
> need a overhaul because the current system doesn't scale ... That
> takes a lot of convincing

That's the nature of many jobs, you get what the people before you did.

They might have not been very good at what they did, or maybe they had
very short-sighted management, but that's the job, is to do your best
to work with what you have.

I have to undo/fix/replace ten years (plus) of short-sighted damage in
my work.  Hey, I have a job and I'm thrilled.

Do what you can so that the people who replace you won't curse you and
throw poisoned darts at a picture of you.

Ken Wolcott

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-08 Thread Paul Johnson
On Mon, Aug 08, 2011 at 10:40:12AM +0530, Ramprasad Prasad wrote:

> Using the system linux sort ... Does not help.
> On my dual quad core machine , (8 gb ram) sort -n file takes 10
> minutes and in the end produces no output.

Did you set any other options?

At a minimum you should set -T to tell sort where to put its temporary
files.  Otherwise they will go into /tmp which you probably don't want.
I expect this was your problem here.

You probably want to set --compress-program=gzip too.  This will
compress the temporary files, reducing IO (which would likely be the
limiting factor otherwise) and making use of some of those cores (which
would likely be sitting idle otherwise).  This will probably both speed
up the sort and reduce the disk space required.

This really is your solution if you just want to sort that file.

-- 
Paul Johnson - p...@pjcj.net
http://www.pjcj.net

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-08 Thread shawn wilson
On Aug 8, 2011 12:11 AM, "Ramprasad Prasad"  wrote:
>
> Using the system linux sort ... Does not help.
> On my dual quad core machine , (8 gb ram) sort -n file takes 10
> minutes and in the end produces no output.
>

I had a smaller file and 32g to play with on a dual quad core (dl320). Sort
just can't handle more than 2~4 gigs.

> when I put this data in mysql , there is an index on the order by
> field ... But I guess keys don't help when you are selecting the
> entire table.
>
> I guess there is a serious need for re-architecting , rather than
> create such monstrous files, but when people work with legacy systems
> which worked fine when there was lower usage and now you tell then you
> need a overhaul because the current system doesn't scale ... That
> takes a lot of convincing
>

You're dealing with a similar issue that I had in this respect too. The only
difference is that I created my own issue out of ignorance (having never
dealt with as much data and having set my dl320 to splice, sort, and merge I
got through that). Well, with this data I just threw 30+ fields of a hundred
thousand lines (yes, you've still got more data to deal with) into one
table. This worked ok until my queries got a bit more complex at which
point, it took me 8+ hours to generate a report. I rethink the tables (or
more like read a bit and think about what the hell I'm doing) and create a
half dozen relationships and I get the report down to little under 2 hours.

My advise is to think about rethinking your db. This is probably going to
mean rethinking software too (or, at least the queries it makes).

You might want to check out the #mysql freenode irc channel - most of them
are pompous but you'll get your answers. I think perl is less related to
your issue but the people in the #dbi and dbic perl irc channels are much
more easy going with their business.

> On 8/8/11, Uri Guttman  wrote:
> >> "RP" == Rajeev Prasad  writes:
> >
> >   RP> hi, you can try this: first get only that field (sed/awk/perl)
> >   RP> whihc you want to sort on in a file. sort that file which i assume
> >   RP> would be lot less in size then your current file/table. then run a
> >   RP> loop on the main file using sorted file as variable.
> >
> >   RP>
> >   RP> here is the logic in shell:
> >   RP>
> >   RP> awk '{print $}'  > tmp-file
> >   RP>
> >   RP> sort 
> >   RP>
> >
> >   RP> for id in `cat `;do grep $id  >>
> > sorted-large-file;done
> >
> > have you thought about the time this will take? you are doing an O( N**2
> > ) grep there. you are looping over all N keys and then scanning the file
> > N lines for each key. that will take a very long time for such a large
> > file. as others have said, either use the sort utility or do a
> > merge/sort on the records. your way is effectively a slow bubble sort!
> >
> > uri
> >
> > --
> > Uri Guttman  --  uri AT perlhunter DOT com  ---
http://www.perlhunter.com
> > --
> >   Perl Developer Recruiting and Placement Services
> > -
> > -  Perl Code Review, Architecture, Development, Training, Support
> > ---
> >
>
> --
> Sent from my mobile device
>
> Thanks
> Ram
>  
>
>
>
>
> n 
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>


Re: Sorting an extremely LARGE file

2011-08-08 Thread Paul Johnson
On Mon, Aug 08, 2011 at 09:25:48AM -0400, shawn wilson wrote:
> On Aug 8, 2011 12:11 AM, "Ramprasad Prasad"  wrote:
> >
> > Using the system linux sort ... Does not help.
> > On my dual quad core machine , (8 gb ram) sort -n file takes 10
> > minutes and in the end produces no output.
> 
> I had a smaller file and 32g to play with on a dual quad core (dl320). Sort
> just can't handle more than 2~4 gigs.

You keep saying this ...

Gnu sort really can handle very large files.  I have even tested it to
make sure.  You may need to configure things slightly.  You may need to
locate some temporary disk space.  You may prefer to do things another
way.  But if you just want to sort a file, sort will do it for you.

-- 
Paul Johnson - p...@pjcj.net
http://www.pjcj.net

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-08 Thread Shlomi Fish
Hi Ramprasad,

On Sun, 7 Aug 2011 20:58:14 +0530
Ramprasad Prasad  wrote:

> I have a file that contains records of customer interaction
> The first column of the file is the batch number(INT) , and other columns
> are date time , close time etc etc
> 
> I have to sort the entire file in order of the first column .. but the
> problem is that the file is extremely huge.
> 
> For the largest customer it contains 1100 million records and the file is
> 44GB !
> how can I sort this big a file
> 

I suggest splitting the files into bins. Each bin will contain the records with
the batch numbers in a certain range (say 0-999,999 ; 1,000,000-1,999,999,
etc.). You should select the bins so the numbers are spread more or less
evenly. Then you sort each bin separately, and then append the bins in order.

Let me know if there's anything else you don't understand, and if you're
interested, I can be commissioned to write it for you (but it shouldn't be too
hard.). 

Regards,

Shlomi Fish

> 
> 
> 
> 



-- 
-
Shlomi Fish   http://www.shlomifish.org/
Why I Love Perl - http://shlom.in/joy-of-perl

Chuck Norris refactors 10 million lines of Perl code before lunch.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-08 Thread Shawn H Corey

On 11-08-08 10:23 AM, Shlomi Fish wrote:

I suggest splitting the files into bins. Each bin will contain the records with
the batch numbers in a certain range (say 0-999,999 ; 1,000,000-1,999,999,
etc.). You should select the bins so the numbers are spread more or less
evenly. Then you sort each bin separately, and then append the bins in order.


Well, if you want a Linux version rather than Perl, see:

man split
man sort
man comm

When you use comm(1), set its --output-delimiter to the empty string.

  --output-delimiter=''


--
Just my 0.0002 million dollars worth,
  Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software:  Fail early & often.

Eliminate software piracy:  use only FLOSS.

"Make something worthwhile."  -- Dear Hunter

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: Sorting an extremely LARGE file

2011-08-08 Thread shawn wilson
On Mon, Aug 8, 2011 at 10:10, Paul Johnson  wrote:
> On Mon, Aug 08, 2011 at 09:25:48AM -0400, shawn wilson wrote:
>> On Aug 8, 2011 12:11 AM, "Ramprasad Prasad"  wrote:
>> >
>> > Using the system linux sort ... Does not help.
>> > On my dual quad core machine , (8 gb ram) sort -n file takes 10
>> > minutes and in the end produces no output.
>>
>> I had a smaller file and 32g to play with on a dual quad core (dl320). Sort
>> just can't handle more than 2~4 gigs.
>
> You keep saying this ...
>
> Gnu sort really can handle very large files.  I have even tested it to
> make sure.  You may need to configure things slightly.  You may need to
> locate some temporary disk space.  You may prefer to do things another
> way.  But if you just want to sort a file, sort will do it for you.
>

very well then (lets assume you're right). what are you saying is the
max file size that sort can handle with what amount of ram and disk?

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/