Re: csplit corrupts files with long lines

2005-09-10 Thread Jim Meyering
Tristan Miller <[EMAIL PROTECTED]> wrote:
> I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a
> SuSE 9.3 system.  It seems this bug was previously reported over a year
> ago (see
> )
> but it was never squashed.
>
> In short, csplit produces corrupt output when the input file contains very
> long lines.  An example file is at
> , an XML file
> containing three articles from Wikipedia.  The second article was
> vandalized by a spammer who inserted a ridiculously long line (42280
> characters) full of links.
>
> If I try to split this file with
>
> $ csplit wikipedia.xml '//' '{*}'
>
> then the file with the second article, xx02, is garbled at the beginning of
> the long line.  See .

Thanks for reporting that!
Here's the fix I've just checked in:
[there's still a minor memory leak, but we'll deal with that separately]

2005-09-10  Jim Meyering  <[EMAIL PROTECTED]>

csplit could produce corrupt output, given input lines longer than 8KB

* src/csplit.c (load_buffer): Don't read from free'd memory
when handling lines longer than the initial buffer length.
(save_to_hold_area): Don't leak the previous hold_area buffer.
Reported by Tristan Miller and Luke Kendall.
* NEWS: Mention this.
* tests/misc/csplit: New test for this.

* src/csplit.c (load_buffer): Avoid integer overflow in buffer
size calculations for very long lines.

Index: NEWS
===
RCS file: /fetish/cu/NEWS,v
retrieving revision 1.307
retrieving revision 1.308
diff -u -p -u -r1.307 -r1.308
--- NEWS10 Sep 2005 00:08:28 -  1.307
+++ NEWS10 Sep 2005 14:07:59 -  1.308
@@ -136,6 +136,8 @@ GNU coreutils NEWS
   permissions like =xX and =u, and did not properly diagnose some invalid
   strings like g+gr, ug,+x, and +1.  These bugs have been fixed.
 
+  csplit could produce corrupt output, given input lines longer than 8KB
+
   dd now computes statistics using a realtime clock (if available)
   rather than the time-of-day clock, to avoid glitches if the
   time-of-day is changed while dd is running.  Also, it avoids
Index: src/csplit.c
===
RCS file: /fetish/cu/src/csplit.c,v
retrieving revision 1.144
retrieving revision 1.145
diff -u -p -u -r1.144 -r1.145
--- src/csplit.c9 Sep 2005 21:08:19 -   1.144
+++ src/csplit.c10 Sep 2005 13:56:45 -  1.145
@@ -251,16 +251,12 @@ interrupt_handler (int sig)
 }
 
 /* Keep track of NUM bytes of a partial line in buffer START.
-   These bytes will be retrieved later when another large buffer is read.
-   It is not necessary to create a new buffer for these bytes; instead,
-   we keep a pointer to the existing buffer.  This buffer *is* on the
-   free list, and when the next buffer is obtained from this list
-   (even if it is this one), these bytes will be placed at the
-   start of the new buffer. */
+   These bytes will be retrieved later when another large buffer is read.  */
 
 static void
 save_to_hold_area (char *start, size_t num)
 {
+  free (hold_area);
   hold_area = start;
   hold_count = num;
 }
@@ -386,7 +382,7 @@ record_line_starts (struct buffer_record
  lines++;
}
   else
-   save_to_hold_area (line_start, bytes_left);
+   save_to_hold_area (xmemdup (line_start, bytes_left), bytes_left);
 }
 
   b->num_lines = lines;
@@ -442,6 +438,7 @@ static void
 free_buffer (struct buffer_record *buf)
 {
   free (buf->buffer);
+  buf->buffer = NULL;
 }
 
 /* Append buffer BUF to the linked list of buffers that contain
@@ -495,7 +492,7 @@ load_buffer (void)
   if (bytes_wanted < hold_count)
 bytes_wanted = hold_count;
 
-  do
+  while (1)
 {
   b = get_new_buffer (bytes_wanted);
   bytes_avail = b->bytes_alloc; /* Size of buffer returned. */
@@ -504,8 +501,7 @@ load_buffer (void)
   /* First check the `holding' area for a partial line. */
   if (hold_count)
{
- if (p != hold_area)
-   memcpy (p, hold_area, hold_count);
+ memcpy (p, hold_area, hold_count);
  p += hold_count;
  b->bytes_used += hold_count;
  bytes_avail -= hold_count;
@@ -515,11 +511,18 @@ load_buffer (void)
   b->bytes_used += read_input (p, bytes_avail);
 
   lines_found = record_line_starts (b);
-  bytes_wanted = b->bytes_alloc * 2;
   if (!lines_found)
free_buffer (b);
+
+  if (lines_found || have_read_eof)
+   break;
+
+  if (xalloc_oversized (2, b->bytes_alloc))
+   xalloc_die ();
+  bytes_wanted = 2 * b->bytes_alloc;
+  free_buffer (b);
+  free (b);
 }
-  while (!lines_found && !have_read_eof);
 
   if (lines_found)
 save_buffer (b);
I

csplit corrupts files with long lines

2005-09-09 Thread Tristan Miller
Greetings.

I'm reporting a bug with csplit coreutils 5.2.1, compiled from sources on a 
SuSE 9.3 system.  It seems this bug was previously reported over a year 
ago (see 
) 
but it was never squashed.

In short, csplit produces corrupt output when the input file contains very 
long lines.  An example file is at 
, an XML file 
containing three articles from Wikipedia.  The second article was 
vandalized by a spammer who inserted a ridiculously long line (42280 
characters) full of links.

If I try to split this file with

$ csplit wikipedia.xml '//' '{*}'

then the file with the second article, xx02, is garbled at the beginning of 
the long line.  See .

Regards,
Tristan

-- 
   _
  _V.-o  Tristan Miller [en,(fr,de,ia)]  ><  Space is limited
 / |`-'  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  <>  In a haiku, so it's hard
(7_\\http://www.nothingisreal.com/   ><  To finish what you


pgpppyH8NXPWm.pgp
Description: PGP signature
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils