Re: csplit corrupts files with long lines

2005-09-10 Thread Jim Meyering
Tristan Miller [EMAIL PROTECTED] wrote:
 I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a
 SuSE 9.3 system.  It seems this bug was previously reported over a year
 ago (see
 http://lists.gnu.org/archive/html/bug-coreutils/2004-08/msg00112.html)
 but it was never squashed.

 In short, csplit produces corrupt output when the input file contains very
 long lines.  An example file is at
 http://www.dfki.uni-kl.de/~miller/tmp/wikipedia.xml, an XML file
 containing three articles from Wikipedia.  The second article was
 vandalized by a spammer who inserted a ridiculously long line (42280
 characters) full of links.

 If I try to split this file with

 $ csplit wikipedia.xml '/page/' '{*}'

 then the file with the second article, xx02, is garbled at the beginning of
 the long line.  See http://www.dfki.uni-kl.de/~miller/tmp/xx02.

Thanks for reporting that!
Here's the fix I've just checked in:
[there's still a minor memory leak, but we'll deal with that separately]

2005-09-10  Jim Meyering  [EMAIL PROTECTED]

csplit could produce corrupt output, given input lines longer than 8KB

* src/csplit.c (load_buffer): Don't read from free'd memory
when handling lines longer than the initial buffer length.
(save_to_hold_area): Don't leak the previous hold_area buffer.
Reported by Tristan Miller and Luke Kendall.
* NEWS: Mention this.
* tests/misc/csplit: New test for this.

* src/csplit.c (load_buffer): Avoid integer overflow in buffer
size calculations for very long lines.

Index: NEWS
===
RCS file: /fetish/cu/NEWS,v
retrieving revision 1.307
retrieving revision 1.308
diff -u -p -u -r1.307 -r1.308
--- NEWS10 Sep 2005 00:08:28 -  1.307
+++ NEWS10 Sep 2005 14:07:59 -  1.308
@@ -136,6 +136,8 @@ GNU coreutils NEWS
   permissions like =xX and =u, and did not properly diagnose some invalid
   strings like g+gr, ug,+x, and +1.  These bugs have been fixed.
 
+  csplit could produce corrupt output, given input lines longer than 8KB
+
   dd now computes statistics using a realtime clock (if available)
   rather than the time-of-day clock, to avoid glitches if the
   time-of-day is changed while dd is running.  Also, it avoids
Index: src/csplit.c
===
RCS file: /fetish/cu/src/csplit.c,v
retrieving revision 1.144
retrieving revision 1.145
diff -u -p -u -r1.144 -r1.145
--- src/csplit.c9 Sep 2005 21:08:19 -   1.144
+++ src/csplit.c10 Sep 2005 13:56:45 -  1.145
@@ -251,16 +251,12 @@ interrupt_handler (int sig)
 }
 
 /* Keep track of NUM bytes of a partial line in buffer START.
-   These bytes will be retrieved later when another large buffer is read.
-   It is not necessary to create a new buffer for these bytes; instead,
-   we keep a pointer to the existing buffer.  This buffer *is* on the
-   free list, and when the next buffer is obtained from this list
-   (even if it is this one), these bytes will be placed at the
-   start of the new buffer. */
+   These bytes will be retrieved later when another large buffer is read.  */
 
 static void
 save_to_hold_area (char *start, size_t num)
 {
+  free (hold_area);
   hold_area = start;
   hold_count = num;
 }
@@ -386,7 +382,7 @@ record_line_starts (struct buffer_record
  lines++;
}
   else
-   save_to_hold_area (line_start, bytes_left);
+   save_to_hold_area (xmemdup (line_start, bytes_left), bytes_left);
 }
 
   b-num_lines = lines;
@@ -442,6 +438,7 @@ static void
 free_buffer (struct buffer_record *buf)
 {
   free (buf-buffer);
+  buf-buffer = NULL;
 }
 
 /* Append buffer BUF to the linked list of buffers that contain
@@ -495,7 +492,7 @@ load_buffer (void)
   if (bytes_wanted  hold_count)
 bytes_wanted = hold_count;
 
-  do
+  while (1)
 {
   b = get_new_buffer (bytes_wanted);
   bytes_avail = b-bytes_alloc; /* Size of buffer returned. */
@@ -504,8 +501,7 @@ load_buffer (void)
   /* First check the `holding' area for a partial line. */
   if (hold_count)
{
- if (p != hold_area)
-   memcpy (p, hold_area, hold_count);
+ memcpy (p, hold_area, hold_count);
  p += hold_count;
  b-bytes_used += hold_count;
  bytes_avail -= hold_count;
@@ -515,11 +511,18 @@ load_buffer (void)
   b-bytes_used += read_input (p, bytes_avail);
 
   lines_found = record_line_starts (b);
-  bytes_wanted = b-bytes_alloc * 2;
   if (!lines_found)
free_buffer (b);
+
+  if (lines_found || have_read_eof)
+   break;
+
+  if (xalloc_oversized (2, b-bytes_alloc))
+   xalloc_die ();
+  bytes_wanted = 2 * b-bytes_alloc;
+  free_buffer (b);
+  free (b);
 }
-  while (!lines_found  !have_read_eof);
 
   if (lines_found)
 save_buffer (b);
Index: tests/misc/csplit

csplit corrupts files with long lines

2005-09-09 Thread Tristan Miller
Greetings.

I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a 
SuSE 9.3 system.  It seems this bug was previously reported over a year 
ago (see 
http://lists.gnu.org/archive/html/bug-coreutils/2004-08/msg00112.html) 
but it was never squashed.

In short, csplit produces corrupt output when the input file contains very 
long lines.  An example file is at 
http://www.dfki.uni-kl.de/~miller/tmp/wikipedia.xml, an XML file 
containing three articles from Wikipedia.  The second article was 
vandalized by a spammer who inserted a ridiculously long line (42280 
characters) full of links.

If I try to split this file with

$ csplit wikipedia.xml '/page/' '{*}'

then the file with the second article, xx02, is garbled at the beginning of 
the long line.  See http://www.dfki.uni-kl.de/~miller/tmp/xx02.

Regards,
Tristan

-- 
   _
  _V.-o  Tristan Miller [en,(fr,de,ia)]Space is limited
 / |`-'  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=In a haiku, so it's hard
(7_\\http://www.nothingisreal.com/ To finish what you


pgpppyH8NXPWm.pgp
Description: PGP signature
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils