Eddie Epstein wrote:
On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz <[EMAIL PROTECTED]> wrote:

Nothing so easy.  The CAS heap is one large int array.  We grow it
by allocating a new array with the new desired size and copying the
old values over to the new one.  There are several issues with this
method:

* Copying the data takes a surprisingly long time.  There's a test
case in core that does nothing but add new FSs to the CAS, a lot of
them.  Marshall complained about how long it took to run when I
added it (about 20s on my machine).  If you profile that test case,
you will see that the vast majority of time is spent in copying
data from an old heap to a new heap.  If the CAS becomes sufficiently
large (in the hundreds of MBs), the time it takes to actually add
FSs to the CAS is completely dwarfed by the time it takes for the
heap to grow.

* The heap lives in a single large array, and a new single large
array is allocated every time the heap grows.  This is a challenge
for the jvm as it allocates this array in a contiguous block of
memory.  So there must be enough contiguous space on the jvm heap,
which likely means a full heap compaction before a new large array
can be allocated.  Sometimes the jvm fails to allocate that
contiguous space, even though there are enough free bytes on the
jvm jeap.

* Saved the best for last.  When allocating a new array, the old
one hangs around till we have copied the data.  So we're using twice
the necessary space for some period of time.  That space is often
not available.  So any time I see an out-of-memory error for large
documents (and it's not a bug in the annotator chain), it happens
when the CAS heap grows; not because there isn't enough room for
the larger heap, but because the old one is still there as well.
The CAS can only grow to about half the size we have memory for
because of that issue.


The situation is more complicated than portrayed. The heap does not have to
shrink, so the growth penalty is rare and can be eliminated entirely if the
max necessary size heap is specified at startup. FS allocated in the heap do

You don't want to allocate a max heap size of 500M just because
you may need one that big.  You don't even want to allocate 10M
ahead of time because if you have many small documents, you can
do more parallel processing.  So no, I can't specify a large enough
heap at start-up and yes, the heap most certainly has to shrink
on CAS reset.

not have any Java object memory overhead. Garbage collection for separate FS
objects would be [much?] worse than the time it takes currently to clear the
used part of a CAS heap.

I won't believe this until I see it, but I wasn't suggesting
this so I'm not going to argue the point, either.


Going forward, one approach to this problem could be not one
heap array, but a list of arrays.  Every time we grow the heap,
we would just add another array.  That approach solves all the
problems mentioned above while being minimally invasive to the
way the CAS currently works.  However, it raises a new issue:
how do you address cells across several arrays in an efficient
manner?  We don't want to improve performance for large docs at
the expense of small ones.  So heap addresses might stop being
the linear sequence of integers they are today.  Maybe we'll
use the high bits to address the array, and the low bits to
address cells in a given array.  And there goes the watermark.
Maybe this won't be necessary, I don't know at this point.


Each FS object could include an ID that would allow maintaining a high water
mark, of course at the expense of another 4 bytes per. With a heap
constructed from multiple discontiguous arrays, each array could include a
relative ID. This is not to say that the high water mark is always the right
approach :)

I'm trying to decrease the memory overhead, not increase it.


Excellent suggestion, except why not have this discussion now?
We just need to put our heads together and figure out how to address
this requirement to everybody's satisfaction, case closed.  I'm
not disagreeing with the requirement, just the proposed implementation
thereof.  Doing this now may save us (ok, me) a lot of trouble later.


Who is against having the discussion now :)

Marshall seemed to favor a discussion at a later point.  Maybe
I misinterpreted.


Eddie

Reply via email to