date:20080709

[jira] Created: (UIMA-1107) Sofa mapping not applied when annotator loaded from PEAR

2008-07-09 Thread Aaron Kaplan (JIRA)

Sofa mapping not applied when annotator loaded from PEAR


 Key: UIMA-1107
 URL: https://issues.apache.org/jira/browse/UIMA-1107
 Project: UIMA
  Issue Type: Bug
  Components: Core Java Framework
Affects Versions: 2.2.2
Reporter: Aaron Kaplan


I have an aggregate annotator consisting of an annotator A1 that creates a new 
sofa, and an annotator A2 that annotates the new sofa.  A2 is not sofa-aware, 
so in the aggregate descriptor I have defined a sofa mapping.

In the delegateAnalysisEngine element of the aggregate descriptor, if I point 
to A2's component descriptor (A2/desc/A2.xml), the sofa mapping works: A2 
processes the new sofa created by A1.  If I point instead to A2's pear 
installation descriptor (A2/A2_pear.xml), the sofa mapping seems not to be 
applied: A2 processes the initial sofa instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (UIMA-1108) correct character offset for OpenCalais annotator

2008-07-09 Thread Michael Baessler (JIRA)

correct character offset for OpenCalais annotator
-

 Key: UIMA-1108
 URL: https://issues.apache.org/jira/browse/UIMA-1108
 Project: UIMA
  Issue Type: Bug
  Components: Sandbox-CalaisAnnotator
Reporter: Michael Baessler
Assignee: Michael Baessler


the calais service do some text cleaning that manipulates the character 
offsets, this must to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (UIMA-1108) correct character offset for OpenCalais annotator

2008-07-09 Thread Michael Baessler (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Baessler closed UIMA-1108.
--

Resolution: Fixed

 correct character offset for OpenCalais annotator
 -

 Key: UIMA-1108
 URL: https://issues.apache.org/jira/browse/UIMA-1108
 Project: UIMA
  Issue Type: Bug
  Components: Sandbox-CalaisAnnotator
Reporter: Michael Baessler
Assignee: Michael Baessler

 the calais service do some text cleaning that manipulates the character 
 offsets, this must to be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Delta CAS

2008-07-09 Thread Thilo Goetz




Eddie Epstein wrote:

On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz [EMAIL PROTECTED] wrote:


Nothing so easy.  The CAS heap is one large int array.  We grow it
by allocating a new array with the new desired size and copying the
old values over to the new one.  There are several issues with this
method:

* Copying the data takes a surprisingly long time.  There's a test
case in core that does nothing but add new FSs to the CAS, a lot of
them.  Marshall complained about how long it took to run when I
added it (about 20s on my machine).  If you profile that test case,
you will see that the vast majority of time is spent in copying
data from an old heap to a new heap.  If the CAS becomes sufficiently
large (in the hundreds of MBs), the time it takes to actually add
FSs to the CAS is completely dwarfed by the time it takes for the
heap to grow.

* The heap lives in a single large array, and a new single large
array is allocated every time the heap grows.  This is a challenge
for the jvm as it allocates this array in a contiguous block of
memory.  So there must be enough contiguous space on the jvm heap,
which likely means a full heap compaction before a new large array
can be allocated.  Sometimes the jvm fails to allocate that
contiguous space, even though there are enough free bytes on the
jvm jeap.

* Saved the best for last.  When allocating a new array, the old
one hangs around till we have copied the data.  So we're using twice
the necessary space for some period of time.  That space is often
not available.  So any time I see an out-of-memory error for large
documents (and it's not a bug in the annotator chain), it happens
when the CAS heap grows; not because there isn't enough room for
the larger heap, but because the old one is still there as well.
The CAS can only grow to about half the size we have memory for
because of that issue.



The situation is more complicated than portrayed. The heap does not have to
shrink, so the growth penalty is rare and can be eliminated entirely if the
max necessary size heap is specified at startup. FS allocated in the heap do


You don't want to allocate a max heap size of 500M just because
you may need one that big.  You don't even want to allocate 10M
ahead of time because if you have many small documents, you can
do more parallel processing.  So no, I can't specify a large enough
heap at start-up and yes, the heap most certainly has to shrink
on CAS reset.


not have any Java object memory overhead. Garbage collection for separate FS
objects would be [much?] worse than the time it takes currently to clear the
used part of a CAS heap.


I won't believe this until I see it, but I wasn't suggesting
this so I'm not going to argue the point, either.



Going forward, one approach to this problem could be not one

heap array, but a list of arrays.  Every time we grow the heap,
we would just add another array.  That approach solves all the
problems mentioned above while being minimally invasive to the
way the CAS currently works.  However, it raises a new issue:
how do you address cells across several arrays in an efficient
manner?  We don't want to improve performance for large docs at
the expense of small ones.  So heap addresses might stop being
the linear sequence of integers they are today.  Maybe we'll
use the high bits to address the array, and the low bits to
address cells in a given array.  And there goes the watermark.
Maybe this won't be necessary, I don't know at this point.



Each FS object could include an ID that would allow maintaining a high water
mark, of course at the expense of another 4 bytes per. With a heap
constructed from multiple discontiguous arrays, each array could include a
relative ID. This is not to say that the high water mark is always the right
approach :)


I'm trying to decrease the memory overhead, not increase it.



Excellent suggestion, except why not have this discussion now?

We just need to put our heads together and figure out how to address
this requirement to everybody's satisfaction, case closed.  I'm
not disagreeing with the requirement, just the proposed implementation
thereof.  Doing this now may save us (ok, me) a lot of trouble later.



Who is against having the discussion now :)


Marshall seemed to favor a discussion at a later point.  Maybe
I misinterpreted.



Eddie

Re: Delta CAS

2008-07-09 Thread Marshall Schor


Thilo Goetz wrote:



Eddie Epstein wrote:

On Wed, Jul 9, 2008 at 1:51 AM, Thilo Goetz [EMAIL PROTECTED] wrote:


Nothing so easy.  The CAS heap is one large int array.  We grow it
by allocating a new array with the new desired size and copying the
old values over to the new one.  There are several issues with this
method:

* Copying the data takes a surprisingly long time.  There's a test
case in core that does nothing but add new FSs to the CAS, a lot of
them.  Marshall complained about how long it took to run when I
added it (about 20s on my machine).  If you profile that test case,
you will see that the vast majority of time is spent in copying
data from an old heap to a new heap.  If the CAS becomes sufficiently
large (in the hundreds of MBs), the time it takes to actually add
FSs to the CAS is completely dwarfed by the time it takes for the
heap to grow.

* The heap lives in a single large array, and a new single large
array is allocated every time the heap grows.  This is a challenge
for the jvm as it allocates this array in a contiguous block of
memory.  So there must be enough contiguous space on the jvm heap,
which likely means a full heap compaction before a new large array
can be allocated.  Sometimes the jvm fails to allocate that
contiguous space, even though there are enough free bytes on the
jvm jeap.

* Saved the best for last.  When allocating a new array, the old
one hangs around till we have copied the data.  So we're using twice
the necessary space for some period of time.  That space is often
not available.  So any time I see an out-of-memory error for large
documents (and it's not a bug in the annotator chain), it happens
when the CAS heap grows; not because there isn't enough room for
the larger heap, but because the old one is still there as well.
The CAS can only grow to about half the size we have memory for
because of that issue.



The situation is more complicated than portrayed. The heap does not 
have to
shrink, so the growth penalty is rare and can be eliminated entirely 
if the
max necessary size heap is specified at startup. FS allocated in the 
heap do


You don't want to allocate a max heap size of 500M just because
you may need one that big.  You don't even want to allocate 10M
ahead of time because if you have many small documents, you can
do more parallel processing.  So no, I can't specify a large enough
heap at start-up and yes, the heap most certainly has to shrink
on CAS reset.
Some intermediate approach might help here - such as an application or 
annotator being able to provide performance tuning hints to the 
framework.  For instance, a tokenizer might be able to guesstimate the 
number of tokens, based on some average token size estimate divided into 
the size of the document, and provide that as a hint.


not have any Java object memory overhead. Garbage collection for 
separate FS
objects would be [much?] worse than the time it takes currently to 
clear the

used part of a CAS heap.


I won't believe this until I see it, but I wasn't suggesting
this so I'm not going to argue the point, either.



Going forward, one approach to this problem could be not one

heap array, but a list of arrays.  Every time we grow the heap,
we would just add another array.  That approach solves all the
problems mentioned above while being minimally invasive to the
way the CAS currently works.  However, it raises a new issue:
how do you address cells across several arrays in an efficient
manner?  We don't want to improve performance for large docs at
the expense of small ones.  So heap addresses might stop being
the linear sequence of integers they are today.  Maybe we'll
use the high bits to address the array, and the low bits to
address cells in a given array.  And there goes the watermark.
Maybe this won't be necessary, I don't know at this point.



Each FS object could include an ID that would allow maintaining a 
high water

mark, of course at the expense of another 4 bytes per. With a heap
constructed from multiple discontiguous arrays, each array could 
include a
relative ID. This is not to say that the high water mark is always 
the right

approach :)


I'm trying to decrease the memory overhead, not increase it.
Would there be a solution that would work for the multi-block heap, 
without adding 4 bytes per FS Object?


Excellent suggestion, except why not have this discussion now?

We just need to put our heads together and figure out how to address
this requirement to everybody's satisfaction, case closed.  I'm
not disagreeing with the requirement, just the proposed implementation
thereof.  Doing this now may save us (ok, me) a lot of trouble later.



Who is against having the discussion now :)


Marshall seemed to favor a discussion at a later point.  Maybe
I misinterpreted.
I did not intend to express favoring a discussion at a later point 
versus now.  Discussions at any point are good, IMHO.

-Marshall




Eddie

Re: Delta CAS

2008-07-09 Thread Eddie Epstein

On Wed, Jul 9, 2008 at 9:18 AM, Thilo Goetz [EMAIL PROTECTED] wrote:

 You don't want to allocate a max heap size of 500M just because
 you may need one that big.  You don't even want to allocate 10M
 ahead of time because if you have many small documents, you can
 do more parallel processing.  So no, I can't specify a large enough
 heap at start-up and yes, the heap most certainly has to shrink
 on CAS reset.


Sounds like your scenario has multiple threads, each with at least one CAS,
processing a mixed size of documents. Either there is enough Java heap space
to process multiple large documents at the same time or not. Pre-allocating
the CAS heap space and not letting them grow enables soft processing
failures of large documents rather than the unfortunate failure of the
entire JVM.

Can you say more about the scenario(s) we are optimizing for?

Re: Delta CAS

2008-07-09 Thread Thilo Goetz




Eddie Epstein wrote:

On Wed, Jul 9, 2008 at 9:18 AM, Thilo Goetz [EMAIL PROTECTED] wrote:


You don't want to allocate a max heap size of 500M just because
you may need one that big.  You don't even want to allocate 10M
ahead of time because if you have many small documents, you can
do more parallel processing.  So no, I can't specify a large enough
heap at start-up and yes, the heap most certainly has to shrink
on CAS reset.



Sounds like your scenario has multiple threads, each with at least one CAS,


I don't usually have the luxury of running just UIMA on a server.
Other processes want memory, too.


processing a mixed size of documents. Either there is enough Java heap space
to process multiple large documents at the same time or not. Pre-allocating
the CAS heap space and not letting them grow enables soft processing
failures of large documents rather than the unfortunate failure of the
entire JVM.

Can you say more about the scenario(s) we are optimizing for?


Variously sized documents, some of them very large, many very small.

Re: Delta CAS

2008-07-09 Thread Thilo Goetz


Marshall Schor wrote:
Some intermediate approach might help here - such as an application or 
annotator being able to provide performance tuning hints to the 
framework.  For instance, a tokenizer might be able to guesstimate the 
number of tokens, based on some average token size estimate divided into 
the size of the document, and provide that as a hint.


Tell me about it.  We've built a whole framework to try
and figure out ahead of time how much memory processing
a certain document is going to take, so we know how many
threads we can run in parallel before crashing the JVM.
This turns out to be quite difficult if you don't know
what kinds of documents you'll be getting, and you work
with many different languages.

--Thilo

Re: Delta CAS

2008-07-09 Thread Burn Lewis

I think we need another thread to discuss the heap.

Back to the high-water mark ... isn't it just the largest xmi id in the
serialized CAS?  Its relationship to the CAS heap is a matter of
implementation but presumably we can have a design that says any new FSs
must be given an xmi id above the high-water mark when serialized back from
a service.  We already have the requirement that ids must be preserved for
the merging of parallel replies.

Burn.

[jira] Updated: (UIMA-1096) Incorrect metaData returned when deployed as a separate process

2008-07-09 Thread Bhavani Iyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Iyer updated UIMA-1096:
---

Attachment: UIMA-1096.patch

Fixed the inconsistencies described.  One clarification - the default value for 
mutiple references allowed is false.

 Incorrect metaData returned when deployed as a separate process
 ---

 Key: UIMA-1096
 URL: https://issues.apache.org/jira/browse/UIMA-1096
 Project: UIMA
  Issue Type: Bug
  Components: C++ Framework
Reporter: Burn Lewis
 Attachments: UIMA-1096.patch


 Comparing the getMeta reply from a service deployed as a separate C++ process 
 with that from one deployed via JNI I see the following:
 1)
 A typePriority index key is changed:
  fsIndexKey typePriority/  /fsIndexKey
 becomes:
  fsIndexKey featureName /featureName 
 comparatorstandard/comparator /fsIndexKey
 2)
  Invalid xml chars are not escaped, e.g.
 description NAMED gt; NOMINAL gt; PRONOMINAL./description
 becomes
 description NAMED  NOMINAL  PRONOMINAL./description
 3)
 The default of 
  multipleReferencesAllowedtrue/multipleReferencesAllowed
 is inserted in many featureDescriptions
 4)
 These may be bugs in the JNI output:  both typePriorities and 
 operationalProperties are only in the JNI reply.
 5)
 First 2 lines are missing the encoding  xmlns attributes:
 ?xml version=1.0 encoding=UTF-8?
 analysisEngineMetaData 
 xmlns=http://uima.apache.org/resourceSpecifier;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (UIMA-1096) Incorrect metaData returned when deployed as a separate process

2008-07-09 Thread Bhavani Iyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Iyer reassigned UIMA-1096:
--

Assignee: Bhavani Iyer

 Incorrect metaData returned when deployed as a separate process
 ---

 Key: UIMA-1096
 URL: https://issues.apache.org/jira/browse/UIMA-1096
 Project: UIMA
  Issue Type: Bug
  Components: C++ Framework
Reporter: Burn Lewis
Assignee: Bhavani Iyer
 Attachments: UIMA-1096.patch


 Comparing the getMeta reply from a service deployed as a separate C++ process 
 with that from one deployed via JNI I see the following:
 1)
 A typePriority index key is changed:
  fsIndexKey typePriority/  /fsIndexKey
 becomes:
  fsIndexKey featureName /featureName 
 comparatorstandard/comparator /fsIndexKey
 2)
  Invalid xml chars are not escaped, e.g.
 description NAMED gt; NOMINAL gt; PRONOMINAL./description
 becomes
 description NAMED  NOMINAL  PRONOMINAL./description
 3)
 The default of 
  multipleReferencesAllowedtrue/multipleReferencesAllowed
 is inserted in many featureDescriptions
 4)
 These may be bugs in the JNI output:  both typePriorities and 
 operationalProperties are only in the JNI reply.
 5)
 First 2 lines are missing the encoding  xmlns attributes:
 ?xml version=1.0 encoding=UTF-8?
 analysisEngineMetaData 
 xmlns=http://uima.apache.org/resourceSpecifier;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Delta CAS

2008-07-09 Thread Marshall Schor

Here's a suggestion suggested by previous posts, and common hardware 
design for segmented memory.


Take the int values that represent feature structure (fs) references.  
Today, these are positive numbers from 1 (I think) to around 4 billion.  
These values are used directly as an index into the heap.


Change this to split the bits in these int values into two parts, let's 
call them upper and lower.  For example

       

where the xxx's are the upper bits (each x represents a hex digit), and 
the y's the lower bits.  The y's in this case can represent numbers up 
to 1 million (approx), and the xxx's represent 4096 values.


Then allocate the heap using multiple 1 meg entry tables, and store each 
one in the 4096 entry reference array.  The heap reference would be some 
bit-wise shifting and indexed lookup in addition to what we have now and 
would probably be very fast, and could be optimized for the xxx=0 case 
to be even faster.


This breaks heaps of over 1 meg into separate parts, which would make 
them more managable, I think, and keeps the high-water mark method 
viable, too.


Opinions?

-Marshall

[jira] Updated: (UIMA-1104) Need a monitor component for UIMA-AS services to capture performance metrics

2008-07-09 Thread Jerry Cwiklik (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Cwiklik updated UIMA-1104:


Attachment: uimaj-as-core-UIMA-1104-patch-04.txt
uimaj-as-activemq-UIMA-1104-patch-04.txt

Fixed idle time for remotes

 Need a monitor component for UIMA-AS services to capture performance metrics 
 -

 Key: UIMA-1104
 URL: https://issues.apache.org/jira/browse/UIMA-1104
 Project: UIMA
  Issue Type: New Feature
  Components: Async Scaleout
Reporter: Jerry Cwiklik
 Attachments: idleWithRemote.txt, 
 uimaj-as-activemq-UIMA-1104-patch-03.txt, 
 uimaj-as-activemq-UIMA-1104-patch-04.txt, 
 uimaj-as-activemq-UIMA-1104-patch.txt, uimaj-as-core-UIMA-1104-patch-02.txt, 
 uimaj-as-core-UIMA-1104-patch-03.txt, uimaj-as-core-UIMA-1104-patch-04.txt, 
 uimaj-as-core-UIMA-1104-patch.txt


 In complex uima-as deployments it is hard to find bottlenecks which need 
 scaleup. A JMX-based monitor is needed to collect runtime metrics from every 
 uima-as service. The metrics must include idle time, queue depth, amount of 
 time each service waits for a free CAS. The monitor should be an embeddable 
 component that can be deployed in a java application. The monitor should 
 allow custom formatting of metrics via pluggable extension. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (UIMA-1104) Need a monitor component for UIMA-AS services to capture performance metrics

2008-07-09 Thread Jerry Cwiklik (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Cwiklik updated UIMA-1104:


Attachment: uimaj-as-core-UIMA-1104-patch-05.txt
uimaj-as-activemq-UIMA-1104-patch-05.txt

Removed debugging output from the code

 Need a monitor component for UIMA-AS services to capture performance metrics 
 -

 Key: UIMA-1104
 URL: https://issues.apache.org/jira/browse/UIMA-1104
 Project: UIMA
  Issue Type: New Feature
  Components: Async Scaleout
Reporter: Jerry Cwiklik
 Attachments: idleWithRemote.txt, 
 uimaj-as-activemq-UIMA-1104-patch-03.txt, 
 uimaj-as-activemq-UIMA-1104-patch-04.txt, 
 uimaj-as-activemq-UIMA-1104-patch-05.txt, 
 uimaj-as-activemq-UIMA-1104-patch.txt, uimaj-as-core-UIMA-1104-patch-02.txt, 
 uimaj-as-core-UIMA-1104-patch-03.txt, uimaj-as-core-UIMA-1104-patch-04.txt, 
 uimaj-as-core-UIMA-1104-patch-05.txt, uimaj-as-core-UIMA-1104-patch.txt


 In complex uima-as deployments it is hard to find bottlenecks which need 
 scaleup. A JMX-based monitor is needed to collect runtime metrics from every 
 uima-as service. The metrics must include idle time, queue depth, amount of 
 time each service waits for a free CAS. The monitor should be an embeddable 
 component that can be deployed in a java application. The monitor should 
 allow custom formatting of metrics via pluggable extension. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Delta CAS

2008-07-09 Thread Adam Lally

 Back to the high-water mark ... isn't it just the largest xmi id in the
 serialized CAS?  Its relationship to the CAS heap is a matter of
 implementation but presumably we can have a design that says any new FSs
 must be given an xmi id above the high-water mark when serialized back from
 a service.  We already have the requirement that ids must be preserved for
 the merging of parallel replies.


Yes - there are really two definitions of high-water mark floating
around in this thread and it would be good to split them apart.

(1) the largest xmi:id in the serialized CAS.  This is a requirement
that the service protocol places on the CAS serializer.  This is what
we already have for merging, and I don't think Thilo is objecting to
this.

(2) a dependency on the FS address being an indicator of which FS are
newer than others (an FS with a larger address is newer).

As I think about it now I am actually unclear on whether we are doing
#2 right now at all.  Bhavani said we were, but that's not how I
recall that the serializer currently works.  It keeps a table of all
the incoming FS, which is necessary in order to have the xmi:ids going
out be the same as the ones coming in.  So I thought the serializer
just used the fact that an FS was missing from this table to determine
that it was new, and *not* a high water mark of the FS address.
Bhavani, can you clarify?

  -Adam

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

2008-07-09 Thread Jerry Cwiklik (JIRA)

[
https://issues.apache.org/jira/browse/UIMA-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jerry Cwiklik updated UIMA-1105:

Attachment: ProcessingUnit-patch.txt

Fixes the hang in the CPE. The CPE, during its error handling, was trying to
fetch a free CAS instance from a cas pool to convert CasData representation to
CasObject. This conversion has already been done prior calling process() method
and is unnecessary. A call to notifyListeners() contains an incorrect arg value
which forces the code to attempt to fetch a new CAS for conversion. Moved the
code that sets the right parameter for the notifyListeners() just before the
process() call. When an exception happens, the code doesnt attempt to fetch a
new CAS from a cas pool.

CPE is sutck trying to retrieve a free CAS from the pool

Key: UIMA-1105
URL: https://issues.apache.org/jira/browse/UIMA-1105
Project: UIMA
Issue Type: Bug
Components: Collection Processing
Affects Versions: 2.2.1
Environment: Windows XP 32 bits
Reporter: Olivier Terrier
Attachments: cpe.xml, ProcessingUnit-patch.txt, uima.zip

Buggy scenario is a CPE with a first remote processor deployed as a Vinci
service and an integrated CAS consumer that throws a ResourceProcessException
in its process method.
It is quite easy to reproduce with a dummy consumer with this implementation

public void processCas(CAS aCAS) throws ResourceProcessException {
throw new ResourceProcessException(new FileNotFoundException(file not
found));
}
It looks like the CPE is stuck trying to retrieve a CAS from the CAS pool
that is apparently empty at some point.
My feeling is that when you have an ResourceProcessException thrown in the
last component of the CPE, the code that is supposed to release the CAS from
the CAS pool is not properly called...
If I suspend the process in Eclipse I can see that the CasConsumer and the
Collection Reader pipelines Threads are waiting on the
CPECasPool.getCas(long) method
I attach the uima.log set to the FINEST level

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Delta CAS

2008-07-09 Thread Eddie Epstein

No opinions, but a few observations:

1M is way too big for some applications that need very small, but very many
CASes.

Large arrays may be bigger than whatever segment size is chosen, making
segment management a bit more complicated.

There will be holes at the top of every segment when the next FS doesn't
fit.

Eddie

On Wed, Jul 9, 2008 at 2:37 PM, Marshall Schor [EMAIL PROTECTED] wrote:

 Here's a suggestion suggested by previous posts, and common hardware design
 for segmented memory.

 Take the int values that represent feature structure (fs) references.
  Today, these are positive numbers from 1 (I think) to around 4 billion.
  These values are used directly as an index into the heap.

 Change this to split the bits in these int values into two parts, let's
 call them upper and lower.  For example
        

 where the xxx's are the upper bits (each x represents a hex digit), and the
 y's the lower bits.  The y's in this case can represent numbers up to 1
 million (approx), and the xxx's represent 4096 values.

 Then allocate the heap using multiple 1 meg entry tables, and store each
 one in the 4096 entry reference array.  The heap reference would be some
 bit-wise shifting and indexed lookup in addition to what we have now and
 would probably be very fast, and could be optimized for the xxx=0 case to be
 even faster.

 This breaks heaps of over 1 meg into separate parts, which would make them
 more managable, I think, and keeps the high-water mark method viable, too.

 Opinions?

 -Marshall

[jira] Resolved: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

2008-07-09 Thread Marshall Schor (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall Schor resolved UIMA-1105.
--

Resolution: Fixed

Applied the patch.  I'll attach the modified Jar to this issue, for testing. - 
Olivier - can you test it?  I think you should just be able to do this by 
replacing the jar in your classpath (unless there are other changes... this was 
built on the latest version in the trunk, but I don't think this part of the 
code has been changing very much).

 CPE is sutck trying to retrieve a free CAS from the pool
 

 Key: UIMA-1105
 URL: https://issues.apache.org/jira/browse/UIMA-1105
 Project: UIMA
  Issue Type: Bug
  Components: Collection Processing
Affects Versions: 2.2.1
 Environment: Windows XP 32 bits
Reporter: Olivier Terrier
 Attachments: cpe.xml, ProcessingUnit-patch.txt, uima.zip


 Buggy scenario is a CPE with a first remote processor deployed as a Vinci 
 service and an integrated CAS consumer that throws a ResourceProcessException 
 in its process method.
 It is quite easy to reproduce with a dummy consumer with this implementation
  
 public void processCas(CAS aCAS) throws ResourceProcessException {
   throw new ResourceProcessException(new FileNotFoundException(file not 
 found));
 }
 It looks like the CPE is stuck trying to retrieve a CAS from the CAS pool 
 that is apparently empty at some point.
 My feeling is that when you have an ResourceProcessException thrown in the 
 last component of the CPE, the code that is supposed to release the CAS from 
 the CAS pool is not properly called...
 If I suspend the process in Eclipse I can see that the CasConsumer and the 
 Collection Reader pipelines Threads are waiting on the
  CPECasPool.getCas(long) method
 I attach the uima.log set to the FINEST level

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

2008-07-09 Thread Marshall Schor (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall Schor reassigned UIMA-1105:


Assignee: Marshall Schor

 CPE is sutck trying to retrieve a free CAS from the pool
 

 Key: UIMA-1105
 URL: https://issues.apache.org/jira/browse/UIMA-1105
 Project: UIMA
  Issue Type: Bug
  Components: Collection Processing
Affects Versions: 2.2.1
 Environment: Windows XP 32 bits
Reporter: Olivier Terrier
Assignee: Marshall Schor
 Attachments: cpe.xml, ProcessingUnit-patch.txt, uima.zip


 Buggy scenario is a CPE with a first remote processor deployed as a Vinci 
 service and an integrated CAS consumer that throws a ResourceProcessException 
 in its process method.
 It is quite easy to reproduce with a dummy consumer with this implementation
  
 public void processCas(CAS aCAS) throws ResourceProcessException {
   throw new ResourceProcessException(new FileNotFoundException(file not 
 found));
 }
 It looks like the CPE is stuck trying to retrieve a CAS from the CAS pool 
 that is apparently empty at some point.
 My feeling is that when you have an ResourceProcessException thrown in the 
 last component of the CPE, the code that is supposed to release the CAS from 
 the CAS pool is not properly called...
 If I suspend the process in Eclipse I can see that the CasConsumer and the 
 Collection Reader pipelines Threads are waiting on the
  CPECasPool.getCas(long) method
 I attach the uima.log set to the FINEST level

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

2008-07-09 Thread Marshall Schor (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall Schor updated UIMA-1105:
-

Fix Version/s: 2.3

 CPE is sutck trying to retrieve a free CAS from the pool
 

 Key: UIMA-1105
 URL: https://issues.apache.org/jira/browse/UIMA-1105
 Project: UIMA
  Issue Type: Bug
  Components: Collection Processing
Affects Versions: 2.2.1
 Environment: Windows XP 32 bits
Reporter: Olivier Terrier
Assignee: Marshall Schor
 Fix For: 2.3

 Attachments: cpe.xml, ProcessingUnit-patch.txt, uima.zip


 Buggy scenario is a CPE with a first remote processor deployed as a Vinci 
 service and an integrated CAS consumer that throws a ResourceProcessException 
 in its process method.
 It is quite easy to reproduce with a dummy consumer with this implementation
  
 public void processCas(CAS aCAS) throws ResourceProcessException {
   throw new ResourceProcessException(new FileNotFoundException(file not 
 found));
 }
 It looks like the CPE is stuck trying to retrieve a CAS from the CAS pool 
 that is apparently empty at some point.
 My feeling is that when you have an ResourceProcessException thrown in the 
 last component of the CPE, the code that is supposed to release the CAS from 
 the CAS pool is not properly called...
 If I suspend the process in Eclipse I can see that the CasConsumer and the 
 Collection Reader pipelines Threads are waiting on the
  CPECasPool.getCas(long) method
 I attach the uima.log set to the FINEST level

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

2008-07-09 Thread Marshall Schor (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall Schor updated UIMA-1105:
-

Attachment: uima-cpe.jar

build for testing

 CPE is sutck trying to retrieve a free CAS from the pool
 

 Key: UIMA-1105
 URL: https://issues.apache.org/jira/browse/UIMA-1105
 Project: UIMA
  Issue Type: Bug
  Components: Collection Processing
Affects Versions: 2.2.1
 Environment: Windows XP 32 bits
Reporter: Olivier Terrier
Assignee: Marshall Schor
 Fix For: 2.3

 Attachments: cpe.xml, ProcessingUnit-patch.txt, uima-cpe.jar, uima.zip


 Buggy scenario is a CPE with a first remote processor deployed as a Vinci 
 service and an integrated CAS consumer that throws a ResourceProcessException 
 in its process method.
 It is quite easy to reproduce with a dummy consumer with this implementation
  
 public void processCas(CAS aCAS) throws ResourceProcessException {
   throw new ResourceProcessException(new FileNotFoundException(file not 
 found));
 }
 It looks like the CPE is stuck trying to retrieve a CAS from the CAS pool 
 that is apparently empty at some point.
 My feeling is that when you have an ResourceProcessException thrown in the 
 last component of the CPE, the code that is supposed to release the CAS from 
 the CAS pool is not properly called...
 If I suspend the process in Eclipse I can see that the CasConsumer and the 
 Collection Reader pipelines Threads are waiting on the
  CPECasPool.getCas(long) method
 I attach the uima.log set to the FINEST level

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Delta CAS

2008-07-09 Thread Bhavani Iyer

If we are thinking of Delta CAS in the context of service the largest xmi id
works. But
we were also using the same mechanism to support tracking CAS activity by
component.
I suppose in the second case the additional overhead of maintaining a list
of the FSs that
are added may be acceptable.

On Wed, Jul 9, 2008 at 3:48 PM, Adam Lally [EMAIL PROTECTED] wrote:

  Back to the high-water mark ... isn't it just the largest xmi id in the
  serialized CAS?  Its relationship to the CAS heap is a matter of
  implementation but presumably we can have a design that says any new FSs
  must be given an xmi id above the high-water mark when serialized back
 from
  a service.  We already have the requirement that ids must be preserved
 for
  the merging of parallel replies.
 

 Yes - there are really two definitions of high-water mark floating
 around in this thread and it would be good to split them apart.

 (1) the largest xmi:id in the serialized CAS.  This is a requirement
 that the service protocol places on the CAS serializer.  This is what
 we already have for merging, and I don't think Thilo is objecting to
 this.

 (2) a dependency on the FS address being an indicator of which FS are
 newer than others (an FS with a larger address is newer).

 As I think about it now I am actually unclear on whether we are doing
 #2 right now at all.  Bhavani said we were, but that's not how I
 recall that the serializer currently works.  It keeps a table of all
 the incoming FS, which is necessary in order to have the xmi:ids going
 out be the same as the ones coming in.  So I thought the serializer
 just used the fact that an FS was missing from this table to determine
 that it was new, and *not* a high water mark of the FS address.
 Bhavani, can you clarify?

  -Adam

[jira] Created: (UIMA-1107) Sofa mapping not applied when annotator loaded from PEAR

[jira] Created: (UIMA-1108) correct character offset for OpenCalais annotator

[jira] Closed: (UIMA-1108) correct character offset for OpenCalais annotator

Re: Delta CAS

Re: Delta CAS

Re: Delta CAS

Re: Delta CAS

Re: Delta CAS

Re: Delta CAS

[jira] Updated: (UIMA-1096) Incorrect metaData returned when deployed as a separate process

[jira] Assigned: (UIMA-1096) Incorrect metaData returned when deployed as a separate process

Re: Delta CAS

[jira] Updated: (UIMA-1104) Need a monitor component for UIMA-AS services to capture performance metrics

[jira] Updated: (UIMA-1104) Need a monitor component for UIMA-AS services to capture performance metrics

Re: Delta CAS

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

Re: Delta CAS

[jira] Resolved: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

[jira] Assigned: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

[jira] Updated: (UIMA-1105) CPE is sutck trying to retrieve a free CAS from the pool

Re: Delta CAS

22 matches

Site Navigation

Mail list logo

Footer information