Tristan Miller [EMAIL PROTECTED] wrote:
I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a
SuSE 9.3 system. It seems this bug was previously reported over a year
ago (see
http://lists.gnu.org/archive/html/bug-coreutils/2004-08/msg00112.html)
but it was never squashed.
In short, csplit produces corrupt output when the input file contains very
long lines. An example file is at
http://www.dfki.uni-kl.de/~miller/tmp/wikipedia.xml, an XML file
containing three articles from Wikipedia. The second article was
vandalized by a spammer who inserted a ridiculously long line (42280
characters) full of links.
If I try to split this file with
$ csplit wikipedia.xml '/page/' '{*}'
then the file with the second article, xx02, is garbled at the beginning of
the long line. See http://www.dfki.uni-kl.de/~miller/tmp/xx02.
Thanks for reporting that!
Here's the fix I've just checked in:
[there's still a minor memory leak, but we'll deal with that separately]
2005-09-10 Jim Meyering [EMAIL PROTECTED]
csplit could produce corrupt output, given input lines longer than 8KB
* src/csplit.c (load_buffer): Don't read from free'd memory
when handling lines longer than the initial buffer length.
(save_to_hold_area): Don't leak the previous hold_area buffer.
Reported by Tristan Miller and Luke Kendall.
* NEWS: Mention this.
* tests/misc/csplit: New test for this.
* src/csplit.c (load_buffer): Avoid integer overflow in buffer
size calculations for very long lines.
Index: NEWS
===
RCS file: /fetish/cu/NEWS,v
retrieving revision 1.307
retrieving revision 1.308
diff -u -p -u -r1.307 -r1.308
--- NEWS10 Sep 2005 00:08:28 - 1.307
+++ NEWS10 Sep 2005 14:07:59 - 1.308
@@ -136,6 +136,8 @@ GNU coreutils NEWS
permissions like =xX and =u, and did not properly diagnose some invalid
strings like g+gr, ug,+x, and +1. These bugs have been fixed.
+ csplit could produce corrupt output, given input lines longer than 8KB
+
dd now computes statistics using a realtime clock (if available)
rather than the time-of-day clock, to avoid glitches if the
time-of-day is changed while dd is running. Also, it avoids
Index: src/csplit.c
===
RCS file: /fetish/cu/src/csplit.c,v
retrieving revision 1.144
retrieving revision 1.145
diff -u -p -u -r1.144 -r1.145
--- src/csplit.c9 Sep 2005 21:08:19 - 1.144
+++ src/csplit.c10 Sep 2005 13:56:45 - 1.145
@@ -251,16 +251,12 @@ interrupt_handler (int sig)
}
/* Keep track of NUM bytes of a partial line in buffer START.
- These bytes will be retrieved later when another large buffer is read.
- It is not necessary to create a new buffer for these bytes; instead,
- we keep a pointer to the existing buffer. This buffer *is* on the
- free list, and when the next buffer is obtained from this list
- (even if it is this one), these bytes will be placed at the
- start of the new buffer. */
+ These bytes will be retrieved later when another large buffer is read. */
static void
save_to_hold_area (char *start, size_t num)
{
+ free (hold_area);
hold_area = start;
hold_count = num;
}
@@ -386,7 +382,7 @@ record_line_starts (struct buffer_record
lines++;
}
else
- save_to_hold_area (line_start, bytes_left);
+ save_to_hold_area (xmemdup (line_start, bytes_left), bytes_left);
}
b-num_lines = lines;
@@ -442,6 +438,7 @@ static void
free_buffer (struct buffer_record *buf)
{
free (buf-buffer);
+ buf-buffer = NULL;
}
/* Append buffer BUF to the linked list of buffers that contain
@@ -495,7 +492,7 @@ load_buffer (void)
if (bytes_wanted hold_count)
bytes_wanted = hold_count;
- do
+ while (1)
{
b = get_new_buffer (bytes_wanted);
bytes_avail = b-bytes_alloc; /* Size of buffer returned. */
@@ -504,8 +501,7 @@ load_buffer (void)
/* First check the `holding' area for a partial line. */
if (hold_count)
{
- if (p != hold_area)
- memcpy (p, hold_area, hold_count);
+ memcpy (p, hold_area, hold_count);
p += hold_count;
b-bytes_used += hold_count;
bytes_avail -= hold_count;
@@ -515,11 +511,18 @@ load_buffer (void)
b-bytes_used += read_input (p, bytes_avail);
lines_found = record_line_starts (b);
- bytes_wanted = b-bytes_alloc * 2;
if (!lines_found)
free_buffer (b);
+
+ if (lines_found || have_read_eof)
+ break;
+
+ if (xalloc_oversized (2, b-bytes_alloc))
+ xalloc_die ();
+ bytes_wanted = 2 * b-bytes_alloc;
+ free_buffer (b);
+ free (b);
}
- while (!lines_found !have_read_eof);
if (lines_found)
save_buffer (b);
Index: tests/misc/csplit