Re: [Israel.pm] Perl Sort vs. GNU Sort

Assaf Gordon Mon, 20 Dec 2010 17:30:59 -0800

Hello Mikhael,

This is becoming off-topic, I hope it doesn't bother the perl people...

Mikhael Goikhman wrote, On 12/20/2010 06:45 PM:
> On 20 Dec 2010 09:14:55 -0500, Assaf Gordon wrote:
>> 
>> And as I've mentioned, sooner or later a patch will be accepted 
>> that will add this feature to sort, rendering my script obsolete.
> 
> I think your "sort --header N" addition can be useful for some
> cases, but in practice I don't think I miss such option too much.
> BTW, why no "--footer N" for completeness then?

I guess it depends on your use cases. Here, we work with many textual/tabular 
files, coming from many different sources.
It helps a lot if each file has a header line, explaining what each column 
means. Also a good portion of our analysis programs produce complicated text 
files with many many columns.
Having a header line keeps everything inside one file, and avoids the need for 
a meta-file (or worse - forgetting what some numeric columns mean after X 
months of not touch the file).

A footer line - I must admit that I have only seen it used in one scenario: 
when a CGI program produces big output (big = GBs of data),
and you want to make sure that the program successfully completed rather than 
timed-out.
I'm sure that if that becomes a common scenario - people will add/request a 
patch for that too.

> I would rather just have "--ignore-first-lines N" without preserving 
> these lines in the output. I don't mind to specify two options 
> "--header N --skip-header" for this instead, but it can be 
> confusing.

Again, that depends on your use case. For mine, I want those lines in the 
output file.

If you just want to skip the first X lines, then this will work (with recent 
versions of 'tail'):
  tail -n +X INPUT.txt | sort

> Actually, I always wondered why noone felt a need to add dozens of 
> useful (well designed) options like these to core utilities ages
> ago. Personally my scripts (mostly in perl) have options for many
> life cases. I think I know why, besides the obvious "be minimalistic
> and clean" approach.

I can't speak for the coreutil people, but it seems they are trying to be very 
conservative, adhere as closely as possible the POSIX, and only add something 
if it's not easily done with some minimal piping.
But...

> 
> About 15 years ago I wanted to patch head(1) and maybe tail(1) too to
> support something like "--ignore-first-lines N", because this is a so
> frequent task. Just like you want to get the first or last 2 lines 
> (head/tail -2) you may want to get all but the first or last 2 
> lines.
> 

There are many nice new features to many of the coreutils programs, including 
the ability to do what you've just described.
These feature are probably less known, because the stable branch of many linux 
distributions do not ship a recent-enough coreutils version,
or simply because they aren't mentioned in popular books.

> 
> Now some comments about your wrapper. The lastest "sort" version has 
> more options, like "--random-sort", "--batch-size", "--parallel".

Good catch, I'll add those (although "--batch-size" is not supported because 
"--merge" is not supported, but I'll add it for completeness).

> Using AGPL and not GPL for command line utilities is not justified.

This script is (partially) targeted for a web-based platform called "Galaxy", 
which runs programs on the server side and gives the user only the output file. 
I specifically want this script to be AGPL, so that even if someone runs a 
modified version on his server, he will (hopefully) share his modifications.
This might change in the future, but for now - I like AGPL.

> And "sort" also supports fancy option "+2 -3" (sort by third field). 
> For this I suggest just to filter @ARGV before you call GetOptions.

I wouldn't call it "fancy", I'd call it obsolete. from "info sort", section 
"2.9 Standard conformance":
"   Newer versions of POSIX are occasionally incompatible with older
versions.  For example, older versions of POSIX required the command
`sort +1' to sort based on the second and succeeding fields in each
input line, but starting with POSIX 1003.1-2001 the same command is
required to sort the file named `+1', and you must instead use the
command `sort -k 2' to get the field-based sort."

So I'd rather not support it for now.

> Regards, Mikhael.

Thank you for your comments and your time, I appreciate it, and I'll post an 
updated version soon.

-assaf
_______________________________________________
Perl mailing list
[email protected]
http://mail.perl.org.il/mailman/listinfo/perl

Re: [Israel.pm] Perl Sort vs. GNU Sort

Reply via email to