Re: join with header line support

Pádraig Brady Tue, 26 Jan 2010 03:50:47 -0800

On 05/11/09 09:29, Pádraig Brady wrote:

Assaf Gordon wrote:

Hello,


Here's an improved version of the '--header' feature for join, with
tests, NEWS, doc updates.

Reminder: with this option, one can join files even if they contain a
header line as the first line.

I'll be happy to provide more examples and use cases, if needed.

The patch is also available here:
http://cancan.cshl.edu/labmembers/gordon/coreutils8/join_header.patch


Thanks for providing the download as thunderbird is mangling your patch again.
I'll review it and expect to push it soon, unless there are objections.


Sorry for the delay in merging this. Recently we were releasing bug fix builds 
only.

This --header option essentially allows one to use --check-order with headings.
`join` without --check-order will already handle the common case where headings
do match in each file, however using --check-order will fail often when the
header sorts after the first line of data.

Note also that --header will join header lines from each file even if
they don't match, with headings from the first file taking precedence.

There are 2 questions with this.

1. Since this is only specific to --check-order really, perhaps
we should add it as a parameter like --check-order=+N where N
is the number of lines to skip checks on, and output as header lines.

2. Do we want to output headings from the first file
when they don't match the second?

I'll push the attached patch (which has a few tweaks) in a while
unless others want changes as per the questions above.

cheers,
Pádraig.

>From 5cfc59450891d3b3521bd6dd6c41eccf9858835e Mon Sep 17 00:00:00 2001
From: Assaf Gordon <[email protected]>
Date: Fri, 20 Nov 2009 15:24:07 +0000
Subject: [PATCH] join: new --header option to special case first line

This essentially allows one to use --check-order with headings.
Note join without --check-order will already handle the common case
where headings do match in each file, however using --check-order will fail
often when the header sorts after the first line of data.

Note also that this will join header lines from each file even if
they don't match, with headings from the first file taking precedence.

* NEWS: Mention the new option.
* doc/coreutils.texi (join invocation): Likewise.
* src/join.c (usage): Describe the new option.
(join): Join the header lines specially.
* tests/misc/join: Add 5 new tests.
---
 NEWS               |    5 +++++
 doc/coreutils.texi |    9 +++++++++
 src/join.c         |   23 ++++++++++++++++++++++-
 tests/misc/join    |   28 ++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/NEWS b/NEWS
index 530ff95..d3869ac 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,11 @@ GNU coreutils NEWS                                    -*- outline -*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** New features
+
+  join now accepts the --header option, to treat the first line of each
+  file as a header line to be joined and printed unconditionally.
+
 
 * Noteworthy changes in release 8.4 (2010-01-13) [stable]
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 184b55a..2b3d32b 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -5515,6 +5515,15 @@ Do not check that both input files are in sorted order.  This is the default.
 Replace those output fields that are missing in the input with
 @var{string}.
 
+...@item --header
+...@opindex --header
+Treat the first line of each input file as a header line. The header lines will
+be joined and printed as the first output line.  If @option{-o} is used to
+specify output format, the header line will be printed according to the
+specified format.  The header lines will not be checked for ordering even if
+...@option{--check-order} is specified.  Also if the header lines from each file
+do not match, the heading fields from the first file will be used.
+
 @item -i
 @itemx --ignore-case
 @opindex -i
diff --git a/src/join.c b/src/join.c
index 4792f16..cf8e98c 100644
--- a/src/join.c
+++ b/src/join.c
@@ -137,7 +137,8 @@ static enum
 enum
 {
   CHECK_ORDER_OPTION = CHAR_MAX + 1,
-  NOCHECK_ORDER_OPTION
+  NOCHECK_ORDER_OPTION,
+  HEADER_LINE_OPTION
 };
 
 
@@ -146,6 +147,7 @@ static struct option const longopts[] =
   {"ignore-case", no_argument, NULL, 'i'},
   {"check-order", no_argument, NULL, CHECK_ORDER_OPTION},
   {"nocheck-order", no_argument, NULL, NOCHECK_ORDER_OPTION},
+  {"header", no_argument, NULL, HEADER_LINE_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -157,6 +159,10 @@ static struct line uni_blank;
 /* If nonzero, ignore case when comparing join fields.  */
 static bool ignore_case;
 
+/* If nonzero, treat the first line of each file as column headers -
+   join them without checking for ordering */
+static bool join_header_lines;
+
 void
 usage (int status)
 {
@@ -191,6 +197,8 @@ by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.\n\
   --check-order     check that the input is correctly sorted, even\n\
                       if all input lines are pairable\n\
   --nocheck-order   do not check that the input is correctly sorted\n\
+  --header          treat first line in each file as field header line,\n\
+                      print them without trying to pair them.\n\
 "), stdout);
       fputs (HELP_OPTION_DESCRIPTION, stdout);
       fputs (VERSION_OPTION_DESCRIPTION, stdout);
@@ -616,6 +624,15 @@ join (FILE *fp1, FILE *fp2)
   initseq (&seq2);
   getseq (fp2, &seq2, 2);
 
+  if (join_header_lines && seq1.count && seq2.count)
+    {
+      prjoin(seq1.lines[0], seq2.lines[0]);
+      prevline[0] = NULL;
+      prevline[1] = NULL;
+      advance_seq (fp1, &seq1, true, 1);
+      advance_seq (fp2, &seq2, true, 2);
+    }
+
   while (seq1.count && seq2.count)
     {
       size_t i;
@@ -1052,6 +1069,10 @@ main (int argc, char **argv)
                          &nfiles, &prev_optc_status, &optc_status);
           break;
 
+        case HEADER_LINE_OPTION:
+          join_header_lines = true;
+          break;
+
         case_GETOPT_HELP_CHAR;
 
         case_GETOPT_VERSION_CHAR (PROGRAM_NAME, AUTHORS);
diff --git a/tests/misc/join b/tests/misc/join
index 49194e0..4e7798f 100755
--- a/tests/misc/join
+++ b/tests/misc/join
@@ -185,6 +185,34 @@ my @tv = (
 # Before 6.10.143, this would mistakenly fail with the diagnostic:
 # join: File 1 is not in sorted order
 ['chkodr-7', '-12', ["2 a\n1 b\n", ""], "", 0],
+
+# Test '--header' feature
+['header-1', '--header',
+ [ "ID Name\n1 A\n2 B\n", "ID Color\n1 red\n"], "ID Name Color\n1 A red\n", 0],
+
+# '--header' with '--check-order' : The header line is out-of-order but the
+# actual data is in order. This join should succeed.
+['header-2', '--header --check-order',
+ ["ID Name\n1 A\n2 B\n", "ID Color\n2 green\n"],
+ "ID Name Color\n2 B green\n", 0],
+
+# '--header' with '--check-order' : The header line is out-of-order AND the
+# actual data out-of-order. This join should fail.
+['header-3', '--header --check-order',
+ ["ID Name\n2 B\n1 A\n", "ID Color\n2 blue\n"], "ID Name Color\n", 1,
+ "$prog: file 1 is not in sorted order\n"],
+
+# '--header' with specific output format '-o'.
+# output header line should respect the requested format
+['header-4', '--header -o "0,1.3,2.2"',
+ ["ID Group Name\n1 Foo A\n2 Bar B\n", "ID Color\n2 blue\n"],
+ "ID Name Color\n2 B blue\n", 0],
+
+# '--header' always outputs headers from the first file
+# even if the headers from the second file don't match
+['header-5', '--header',
+ [ "ID1 Name\n1 A\n2 B\n", "ID2 Color\n1 red\n"], "ID1 Name Color\n1 A red\n", 0],
+
 );
 
 # Convert the above old-style test vectors to the newer
-- 
1.6.2.5

Re: join with header line support

Reply via email to