On Fri, Jun 16, 2006 at 10:40:29PM +0200, [EMAIL PROTECTED] wrote:
> On Fri, Jun 16, 2006 at 10:14:07PM +0200, Otto Moerbeek wrote:
> > On Fri, 16 Jun 2006, Jacob Yocom-Piatt wrote:
> >
> > > i've got some C code that is reading from a 800 MB CSV file and allocates
> > > memory
> > > for an array to store the data in. the method used is to read the CSV file
> > > line-by-line and realloc additional space with each line read. having
> > > timed this
> > > and found the realloc speed to be low when the array is large, i am
> > > aiming to
> > > make this faster but am not sure about the best way to proceed.
> > >
> > > the current code uses realloc in the manner suggested by the manpage:
> > >
> > > newsize = size + 1;
> > > time(&t1); // start timing realloc
> > > if ((newap = (int *)realloc(ap, newsize*sizeof(int))) == NULL) {
> > > free(ap);
> > > ap = NULL;
> > > size = 0;
> > > return (NULL);
> > > }
> > > time(&t2); // stop timing realloc; start timing
> > > fscanf
> > >
> > > as the size of ap grows, so does the time it takes to realloc the space.
> > >
> > > an alternative to this procedure would be to scan through the CSV file to
> > > determine how many array entries i would need, realloc it all at once,
> > > then go
> > > back through the CSV file again to read the data into the array. i'm not
> > > confident this is the only way to do this and would appreciate any
> > > suggestions
> > > for speeding up this procedure.
> >
> > You'll have to think how this works: realloc has to copy the data to
> > the newly allocated region if it is not able to extend the current
> > region.
> >
> > If you do that for each time you'll increase the allocation size and
> > your size increment is small, you'll do a lot of work. The way you are
> > doing it now has quadratic time complexity.
> >
> > So make your increment larger and start with a larger size. Maybe you
> > can estimate the initial size based on your file size. If you'll end
> > up allocating a too large area, just use realloc the decrease the size
> > after you're done.
> >
> > Also thing about other data structures: you might be better of with a
> > linked list or tree here, depending on what you are doing with your
> > data.
> >
> > -Otto
>
> Also, if the purpose of this is to load the file in memory, why not just
> mmap() it in the first place ?
>
Ooops ... I was doing two things at the same time and did not finish the
mail before sending it.
So basically, why not mmap() the file, go through the map counting \n
while replacing them by \0 until you reach end of map. allocate an
array the size of the counter and have each array entry point to where
it should in the memory map ?