RE: Apache UIMA Java sdk 3.0.0 released

2018-03-21 Thread D. Heinze
Thanks.  I've spent a bit of time trying to ensure that all the XML
dependencies are up-to-date.  Xalan and Xerces (both of which are pulled in
as dependencies by other jars) are both up-to-date.  These warnings seem to
be something that comes around every few years.

I see from the developers list that UIMA-AS may be expected sometime soon?

Thanks / Dan

-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org] 
Sent: Wednesday, March 21, 2018 10:11 AM
To: user@uima.apache.org
Subject: Re: Apache UIMA Java sdk 3.0.0 released

On 20.03.2018, at 15:31, D. Heinze  wrote:
> 
> Yes, I fixed some old dependencies and got rid of the XML errors.  Now 
> I am getting the following warning when calling the code below...  
> Seems this was an old problem that was fixed several UIMA releases 
> ago.  The only old UIMA thing I have in dependencies is uimafit1.4 
> which is for some deprecated code that will eventually be removed.
> 
> WARN  [  uima] SAXTransformerFactory didn't
> recognize setting attribute
> http://javax.xml.XMLConstants/property/accessExternalDTD
> WARN  [  uima] SAXTransformerFactory didn't
> recognize setting attribute
> http://javax.xml.XMLConstants/property/accessExternalStylesheet

UIMA is trying to configure the XML parser to avoid potential security
problems. These messages are issued when the underlying XML parser does not
support the respective configuration properties. The code should work
nevertheless although it might be a good idea to check if you have any
additional "old" XML dependencies which might cause the default JDK XML
implementations not to be used, maybe a Xalan?

-- Richard=



RE: Apache UIMA Java sdk 3.0.0 released

2018-03-20 Thread D. Heinze
Yes, I fixed some old dependencies and got rid of the XML errors.  Now I am
getting the following warning when calling the code below...  Seems this was
an old problem that was fixed several UIMA releases ago.  The only old UIMA
thing I have in dependencies is uimafit1.4 which is for some deprecated code
that will eventually be removed.

WARN  [  uima] SAXTransformerFactory didn't
recognize setting attribute
http://javax.xml.XMLConstants/property/accessExternalDTD
WARN  [  uima] SAXTransformerFactory didn't
recognize setting attribute
http://javax.xml.XMLConstants/property/accessExternalStylesheet

URL resource = getClass().getClassLoader().getResource( descriptorPath);
XMLInputSource source = new XMLInputSource( resource);
AnalysisEngineDescription aed  =
UIMAFramework.getXMLParser().parseAnalysisEngineDescription(source);
ResourceManager manager = UIMAFramework.newDefaultResourceManager(); 
HashMap paramap = new HashMap();
// Set the initial CAS heap size.
paramap.put( UIMAFramework.CAS_INITIAL_HEAP_SIZE, "100");
// Disable JCas cache.
paramap.put( UIMAFramework.JCAS_CACHE_ENABLED, "false");
this.engine = UIMAFramework.produceAnalysisEngine( aed, manager,
paramap);

-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org] 
Sent: Tuesday, March 20, 2018 4:54 AM
To: user@uima.apache.org
Subject: Re: Apache UIMA Java sdk 3.0.0 released

On 19.03.2018, at 20:36, D. Heinze  wrote:
> 
> Got runtime errors that were fixed by finding a version of Xerces (version
2.4.0 in this case) that would keep all dependencies happy.

Hm... I think that the XML support coming with recent JDKs (1.8 in
particular) should be sufficient for the needs of UIMA. 

Did you check if your project setup might be drawing in "old" versions of
XML libraries?

Which runtime errors did you get?

Cheers,

-- Richard=



RE: Apache UIMA Java sdk 3.0.0 released

2018-03-19 Thread D. Heinze
Last week I converted a large project to UIMA 3.0.0.  I updated the Maven 
dependencies, rebuilt all the JCas files and removed the old *_Type.java files. 
 Everything compiled without problem.  Got runtime errors that were fixed by 
finding a version of Xerces (version 2.4.0 in this case) that would keep all 
dependencies happy. For short runs with a single pipeline, I'm noticing about a 
15-20% reduction in processing time and less memory usage (I presume due to GC 
on the CAS).  I had already been running JDK 1.8 for some time.
Am I correct in thinking that there is not a version of UIMA-AS that will work 
with UIMA 3.0.0?  When might a UIMA-AS 3.0.0 be expected.
Thanks / Dan

-Original Message-
From: Marshall Schor [mailto:m...@schor.com] 
Sent: Monday, March 5, 2018 1:08 PM
To: annou...@apache.org
Cc: uima-user; uima-dev
Subject: Apache UIMA Java sdk 3.0.0 released

The Apache UIMA team is pleased to announce the release of the Apache UIMA Java 
SDK, version 3.0.0.  This is the first release of a major re-implementation of 
the UIMA Java SDK, aligning it with Java 8 and high performance multi-core 
processors.

Apache UIMA  is a component architecture and framework 
for the analysis of unstructured content like text, video and audio data.

This release is a major rewrite of the internals of core UIMA, and includes 
many new features, including:
 -- support for arbitrary Java objects in the CAS
 -- New semi-built-in UIMA types: FSArrayList, FSHashSet, IntegerArrayList, 
Int2FS map
 -- New "select" framework integrated with Java 8 Streams
 -- Elimination of concurrent modification exception
  while iterating over UIMA indexes
 -- Automatic Garbage Collection of unreferenced Feature Structures
 -- All around better integration into Java 8 idioms and generic typing

See the UIMA News item (
https://uima.apache.org/news.html#05 Mar 2018 ) for more details.

A full description of the new and changed parts is here:
http://uima.apache.org/d/uimaj-3.0.0/version_3_users_guide.html

This release requires Java 8, and is intended to be backwards compatible with 
existing Version 2 pipeline code, except for the need to regenerate or migrate 
(tooling provided) user-defined JCas class definitions.

Please send feedback via the Apache UIMA project mailing lists.

 -Marshall Schor, for the Apache UIMA development team




RE: Change an at run-time

2016-03-22 Thread D. Heinze
Richard, Burn... thanks for the replies.  I'll try the -D option which appears 
to be just what I wanted.  
Prior to the replies, I had done what Richard also suggested and intercepted 
the file input stream, modified the XML descriptor and then sent it on 
parseResourceSpecifier.  
I had also tried using uimafit, but it looked like ConceptMapper doesn't 
implement what uimafit needs.
Thanks / Dan

-Original Message-
From: Burn Lewis [mailto:burnle...@gmail.com] 
Sent: Monday, March 21, 2016 1:23 PM
To: user@uima.apache.org
Subject: Re: Change anat 
run-time

The link Richard provided indicates that the substitution may be used for a 
 so use:
DictionaryUrl
with
-DDictionaryUrl=com/gnoetics/resources/dictionary/ConceptMapperRoots_dict

(I verified that it works with a file: URL)

For arbitrary modification of descriptors you can use -Duima.framework_impl to 
provide an implementation of  UIMAFramework_impl that supports an XMLParser 
that modifies the XML before passing it to the UIMA parser.

~Burn

On Sat, Mar 19, 2016 at 5:25 PM, Richard Eckart de Castilho 
wrote:

> Hi,
>
> you might find this interesting (although I haven't used that in a 
> long
> time):
>
>
> https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xml.co
> mponent_descriptor.aes.environment_variable_references
>
> Also, after you a have parser the specifier in the last line of your 
> code, you can traverse it, locate the element you wish to change and 
> simply change it at runtime. In that case it doesn't matter what the 
> XML file contains.
>
> The Java object structure is very similar to the XML structure. It may 
> be a bit verbose to actually get to the point where you want to go and 
> *maybe* you'll have to cast some interfaces to their implementations 
> to get access to setters, but in general it should work.
>
> Cheers,
>
> -- Richard
>
> > On 16.03.2016, at 23:46, D. Heinze  wrote:
> >
> > I have separate xml engine descriptor files for set of ConceptMapper 
> > engines, each using a distinct compiled dictionary file that is
> specified in
> >  as shown below.
> > The ConceptMapper engines are configured and run programmatically by 
> > one
> of
> > the engines in the overall UIMA pipeline, where the engines are run 
> > selectively on data that is constructed on the fly.
> > Because all the descriptor files are alike (with stop word 
> > definitions,
> > etc.) except for the  that specifies the dictionary, I 
> > would
> like
> > to be able to have only one xml descriptor and be able to set a 
> > version number as a Java -D parameter when starting the engine and 
> > then add it to the  name after the ResourceSpecifier is 
> > created and before produceAnalysisEngine is called.
> >
> > Is this possible?  Or, can  somehow be omitted from the
> descriptor
> > file and specified at run-time as a configuration parameter?
> > Thanks / Dan
> >
> > ENGINE INITIALIZATION CODE:
> >  is = cls.getClassLoader().getResourceAsStream( desc);
> >  XMLInputSource source = new XMLInputSource( is,  null);
> >  ResourceSpecifier specifier =
> > UIMAFramework.getXMLParser().parseResourceSpecifier( source);
> >  ae = UIMAFramework.produceAnalysisEngine(specifier);
> >
> > DESCRIPTOR (desc):
> >   > xmlns="
> http://uima.apache.org/resourceSpecifier";>
> >  
> > org.apache.uima.java > on>
> >  true
> >
> >
> org.apache.uima.conceptMapper.ConceptMapp
> er > notatorImplementationName>
> > ...
> >  
> >
> >  
> >DictionaryFileName
> >A file containing the dictionary. Modify this 
> > URL to use a different dictionary.
> >
> >
> >
> com/gnoetics/resources/dictionary/ConceptMapperRoots_dict leUrl>
> > 
> >
> >
> org.apache.uima.conceptMapper.support.dictionaryRe
> source
> > .CompiledDictionaryResource_impl
> >  
> >
> >
> >  
> >DictionaryFile
> >DictionaryFileName
> >  
> >
> >  
> > 
> >
>
>



Change an at run-time

2016-03-19 Thread D. Heinze
I have separate xml engine descriptor files for set of ConceptMapper
engines, each using a distinct compiled dictionary file that is specified in
 as shown below.
The ConceptMapper engines are configured and run programmatically by one of
the engines in the overall UIMA pipeline, where the engines are run
selectively on data that is constructed on the fly.
Because all the descriptor files are alike (with stop word definitions,
etc.) except for the  that specifies the dictionary, I would like
to be able to have only one xml descriptor and be able to set a version
number as a Java -D parameter when starting the engine and then add it to
the  name after the ResourceSpecifier is created and before
produceAnalysisEngine is called.

Is this possible?  Or, can  somehow be omitted from the descriptor
file and specified at run-time as a configuration parameter?
Thanks / Dan 

ENGINE INITIALIZATION CODE:
  is = cls.getClassLoader().getResourceAsStream( desc); 
  XMLInputSource source = new XMLInputSource( is,  null); 
  ResourceSpecifier specifier =
UIMAFramework.getXMLParser().parseResourceSpecifier( source);
  ae = UIMAFramework.produceAnalysisEngine(specifier);

DESCRIPTOR (desc):

http://uima.apache.org/resourceSpecifier";>
  org.apache.uima.java
  true
 
org.apache.uima.conceptMapper.ConceptMapper
...
  

  
DictionaryFileName
A file containing the dictionary. Modify this URL to
use a different dictionary.

 
com/gnoetics/resources/dictionary/ConceptMapperRoots_dict
 
 
org.apache.uima.conceptMapper.support.dictionaryResource
.CompiledDictionaryResource_impl
  


  
DictionaryFile
DictionaryFileName
  

  




UIMA memory management and persistence

2016-01-13 Thread D. Heinze
I have a couple questions that I have only been able to find vague or
partial answers to on line:
1. Does UIMA still use C for implementing the CAS or is it all Java now?
2. Does UIMA use the file system to implement the CAS persistence layer
during runtime?
a. I'm curious because my engine loads all its resources into memory,
but between the time that it reads in a document and when it outputs the
results, the system profiler shows a lot of disk activity.  This is plain
UIMA, not UIMA-AS

Thanks / Dan



RE: CAS serializationWithCompression

2016-01-13 Thread D. Heinze
Yes.  That was the main reason I wanted to update from 2.6.0.  Being able to
examine the Json CAS, it took about half an hour to track down the problem.
If I had to hunt blind, it would have taken forever.  I had already profiled
all the code for real and potential memory leaks, but this was one that
didn't show up. 

Thanks / Dan

-Original Message-
From: Marshall Schor [mailto:m...@schor.com] 
Sent: Wednesday, January 13, 2016 2:27 PM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

Great!  Glad to see some use is being made of JSON :-). 

-Marshall

On 1/13/2016 2:05 PM, D. Heinze wrote:
> Found the problem by serializing the CAS to Json.  The CAS sofaText 
> was acting like a pushdown stack and accumulating the full text of 
> each successive document due to an input stream and buffer not getting 
> properly closed/cleared between iterations.
>
> Thanks / Dan
>
> -Original Message-
> From: D. Heinze [mailto:dhei...@gnoetics.com]
> Sent: Tuesday, January 12, 2016 2:13 PM
> To: user@uima.apache.org
> Subject: RE: CAS serializationWithCompression
>
> Thanks Marshall.  Will do.  I just completed upgrading from UIMA 2.6.0 
> to
> 2.8.1 just to make sure there were no issues there.  Will now get back 
> to the CAS serialization issue.  Yes, I've been trying to think of 
> where there could be retained junk that is getting added back into the 
> CAS with each iteration.
>
> -Dan
>
> -Original Message-
> From: Marshall Schor [mailto:m...@schor.com]
> Sent: Tuesday, January 12, 2016 11:56 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
>
> hmmm, seems like unusual behavior.
>
> It would help a lot to diagnose this if you could construct a small 
> test case - one which perhaps creates a cas, fills it with a bit of 
> data, does the compressed serialization, resets the cas, and loops and 
> see if that produces "expanding" serializations.
>
>   -- if it does, please post the test case to a Jira and we'll 
> diagnose / fix this :-)
>
>   -- if it doesn't, then you have to get closer to your actual use 
> case and iterate until you see what it is that you last added that 
> starts making it serialize ever-expanding instances.  That will be a big
clue, I think.
>
> -Marshall
>
> On 1/12/2016 10:54 AM, D. Heinze wrote:
>> The CAS.size() starts as larger than the serializedWithCompression 
>> version, but eventually the serializedWithCompression version grows 
>> to be larger than the CAS.size().
>> The overall process is:
>> * Create a new CAS
>> * Read in an xml document and store the structure and content in the cas.
>> * Tokenize and parse the document and store that info in the cas.
>> * Run a number of lexical engines and ConceptMapper engines on the 
>> data and store that info in the cas
>> * Produce an xml document with the content of the original input 
>> document marked up with the analysis results and both write that out 
>> to a file and also store it in the cas
>> * serializeWithCompression to a FileOutputStream
>> * cas.reset()
>> * iterate on the next input document
>> All the work other than creating and cas.reset() is done using the JCas.
>> Even though the output CASes keep getting larger, they seem to 
>> deserialize just fine and are usable.
>> Thanks/Dan
>>
>> -Original Message-
>> From: Richard Eckart de Castilho [mailto:r...@apache.org]
>> Sent: Tuesday, January 12, 2016 2:45 AM
>> To: user@uima.apache.org
>> Subject: Re: CAS serializationWithCompression
>>
>> Is the CAS.size() larger than the serialized version or smaller?
>> What are you actually doing to the CAS? Just 
>> serializing/deserializing a couple of times in a row, or do you actually
add feature structures?
>> The sample code you show doesn't give any hint about where the CAS 
>> comes from and what is being done with it.
>>
>> -- Richard
>>
>>> On 12.01.2016, at 03:06, D. Heinze  wrote:
>>>
>>> I'm having a problem with CAS serializationWithCompression.  I am 
>>> processing a few million text document on an IBM P8 with 16 physical 
>>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>>>
>>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>>>
>>> I use serializeWithCompression to save the final state of the 
>>> processing on each document to a file for later processing.
>>>
>>> However, the size of the serialized CAS just keeps growing.  The 
>>> size of the CAS is stable, but the serialized CASes just keep 
>>> ge

RE: CAS serializationWithCompression

2016-01-13 Thread D. Heinze
Found the problem by serializing the CAS to Json.  The CAS sofaText was
acting like a pushdown stack and accumulating the full text of each
successive document due to an input stream and buffer not getting properly
closed/cleared between iterations.

Thanks / Dan

-Original Message-
From: D. Heinze [mailto:dhei...@gnoetics.com] 
Sent: Tuesday, January 12, 2016 2:13 PM
To: user@uima.apache.org
Subject: RE: CAS serializationWithCompression

Thanks Marshall.  Will do.  I just completed upgrading from UIMA 2.6.0 to
2.8.1 just to make sure there were no issues there.  Will now get back to
the CAS serialization issue.  Yes, I've been trying to think of where there
could be retained junk that is getting added back into the CAS with each
iteration.

-Dan

-Original Message-
From: Marshall Schor [mailto:m...@schor.com]
Sent: Tuesday, January 12, 2016 11:56 AM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

hmmm, seems like unusual behavior.

It would help a lot to diagnose this if you could construct a small test
case - one which perhaps creates a cas, fills it with a bit of data, does
the compressed serialization, resets the cas, and loops and see if that
produces "expanding" serializations.

  -- if it does, please post the test case to a Jira and we'll diagnose /
fix this :-)

  -- if it doesn't, then you have to get closer to your actual use case and
iterate until you see what it is that you last added that starts making it
serialize ever-expanding instances.  That will be a big clue, I think.

-Marshall

On 1/12/2016 10:54 AM, D. Heinze wrote:
> The CAS.size() starts as larger than the serializedWithCompression 
> version, but eventually the serializedWithCompression version grows to 
> be larger than the CAS.size().
> The overall process is:
> * Create a new CAS
> * Read in an xml document and store the structure and content in the cas.
> * Tokenize and parse the document and store that info in the cas.
> * Run a number of lexical engines and ConceptMapper engines on the 
> data and store that info in the cas
> * Produce an xml document with the content of the original input 
> document marked up with the analysis results and both write that out 
> to a file and also store it in the cas
> * serializeWithCompression to a FileOutputStream
> * cas.reset()
> * iterate on the next input document
> All the work other than creating and cas.reset() is done using the JCas.
> Even though the output CASes keep getting larger, they seem to 
> deserialize just fine and are usable.
> Thanks/Dan
>
> -Original Message-
> From: Richard Eckart de Castilho [mailto:r...@apache.org]
> Sent: Tuesday, January 12, 2016 2:45 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
>
> Is the CAS.size() larger than the serialized version or smaller?
> What are you actually doing to the CAS? Just serializing/deserializing 
> a couple of times in a row, or do you actually add feature structures?
> The sample code you show doesn't give any hint about where the CAS 
> comes from and what is being done with it.
>
> -- Richard
>
>> On 12.01.2016, at 03:06, D. Heinze  wrote:
>>
>> I'm having a problem with CAS serializationWithCompression.  I am 
>> processing a few million text document on an IBM P8 with 16 physical 
>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>>
>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>>
>> I use serializeWithCompression to save the final state of the 
>> processing on each document to a file for later processing.
>>
>> However, the size of the serialized CAS just keeps growing.  The size 
>> of the CAS is stable, but the serialized CASes just keep getting 
>> bigger. I even went to creating a new CAS for each process instead of 
>> using cas.reset().  I have also tried writing the serialized CAS to a 
>> byte array output stream first and then to a file, but it is the 
>> serializeWithCompression that caused the size problem not writing the
> file.
>> Here's what the code looks like.  Flushing or not flushing does not 
>> make a difference.  Closing or not closing the file output strem does 
>> not make a difference (other than leaking memory).  I've also tried 
>> doing serializeWithCompression with type filtering.  Wanted to try 
>> using a Marker, but cannot see how to do that.  The problem exists 
>> regardless of doing 1 or
>> 55 pipelines concurrently.
>>
>>
>>
>>File fout = new File(documentPath);
>>
>>fos = new FileOutputStream(fout);
>>
>>
>> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
>> cas, fos);
>>
>>fos.flush();
>>
>>fos.close();
>>
>>logger.info( "serializedCas size " + cas.size() + " ToFile " + 
>> documentPath);
>>
>>
>>
>> Suggestions will be appreciated.
>>
>>
>>
>> Thanks / Dan
>>
>>
>>
>



RE: CAS serializationWithCompression

2016-01-12 Thread D. Heinze
Thanks Marshall.  Will do.  I just completed upgrading from UIMA 2.6.0 to
2.8.1 just to make sure there were no issues there.  Will now get back to
the CAS serialization issue.  Yes, I've been trying to think of where there
could be retained junk that is getting added back into the CAS with each
iteration.

-Dan

-Original Message-
From: Marshall Schor [mailto:m...@schor.com] 
Sent: Tuesday, January 12, 2016 11:56 AM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

hmmm, seems like unusual behavior.

It would help a lot to diagnose this if you could construct a small test
case - one which perhaps creates a cas, fills it with a bit of data, does
the compressed serialization, resets the cas, and loops and see if that
produces "expanding" serializations.

  -- if it does, please post the test case to a Jira and we'll diagnose /
fix this :-)

  -- if it doesn't, then you have to get closer to your actual use case and
iterate until you see what it is that you last added that starts making it
serialize ever-expanding instances.  That will be a big clue, I think.

-Marshall

On 1/12/2016 10:54 AM, D. Heinze wrote:
> The CAS.size() starts as larger than the serializedWithCompression 
> version, but eventually the serializedWithCompression version grows to 
> be larger than the CAS.size().
> The overall process is:
> * Create a new CAS
> * Read in an xml document and store the structure and content in the cas.
> * Tokenize and parse the document and store that info in the cas.
> * Run a number of lexical engines and ConceptMapper engines on the 
> data and store that info in the cas
> * Produce an xml document with the content of the original input 
> document marked up with the analysis results and both write that out 
> to a file and also store it in the cas
> * serializeWithCompression to a FileOutputStream
> * cas.reset()
> * iterate on the next input document
> All the work other than creating and cas.reset() is done using the JCas.
> Even though the output CASes keep getting larger, they seem to 
> deserialize just fine and are usable.
> Thanks/Dan
>
> -Original Message-
> From: Richard Eckart de Castilho [mailto:r...@apache.org]
> Sent: Tuesday, January 12, 2016 2:45 AM
> To: user@uima.apache.org
> Subject: Re: CAS serializationWithCompression
>
> Is the CAS.size() larger than the serialized version or smaller?
> What are you actually doing to the CAS? Just serializing/deserializing 
> a couple of times in a row, or do you actually add feature structures?
> The sample code you show doesn't give any hint about where the CAS 
> comes from and what is being done with it.
>
> -- Richard
>
>> On 12.01.2016, at 03:06, D. Heinze  wrote:
>>
>> I'm having a problem with CAS serializationWithCompression.  I am 
>> processing a few million text document on an IBM P8 with 16 physical 
>> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
>>
>> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
>>
>> I use serializeWithCompression to save the final state of the 
>> processing on each document to a file for later processing.
>>
>> However, the size of the serialized CAS just keeps growing.  The size 
>> of the CAS is stable, but the serialized CASes just keep getting 
>> bigger. I even went to creating a new CAS for each process instead of 
>> using cas.reset().  I have also tried writing the serialized CAS to a 
>> byte array output stream first and then to a file, but it is the 
>> serializeWithCompression that caused the size problem not writing the
> file.
>> Here's what the code looks like.  Flushing or not flushing does not 
>> make a difference.  Closing or not closing the file output strem does 
>> not make a difference (other than leaking memory).  I've also tried 
>> doing serializeWithCompression with type filtering.  Wanted to try 
>> using a Marker, but cannot see how to do that.  The problem exists 
>> regardless of doing 1 or
>> 55 pipelines concurrently.
>>
>>
>>
>>File fout = new File(documentPath);
>>
>>fos = new FileOutputStream(fout);
>>
>>
>> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
>> cas, fos);
>>
>>fos.flush();
>>
>>fos.close();
>>
>>logger.info( "serializedCas size " + cas.size() + " ToFile " + 
>> documentPath);
>>
>>
>>
>> Suggestions will be appreciated.
>>
>>
>>
>> Thanks / Dan
>>
>>
>>
>



RE: CAS serializationWithCompression

2016-01-12 Thread D. Heinze
The CAS.size() starts as larger than the serializedWithCompression version,
but eventually the serializedWithCompression version grows to be larger than
the CAS.size().
The overall process is:
* Create a new CAS
* Read in an xml document and store the structure and content in the cas.
* Tokenize and parse the document and store that info in the cas.
* Run a number of lexical engines and ConceptMapper engines on the data and
store that info in the cas
* Produce an xml document with the content of the original input document
marked up with the analysis results and both write that out to a file and
also store it in the cas
* serializeWithCompression to a FileOutputStream
* cas.reset()
* iterate on the next input document
All the work other than creating and cas.reset() is done using the JCas.
Even though the output CASes keep getting larger, they seem to deserialize
just fine and are usable.
Thanks/Dan

-Original Message-
From: Richard Eckart de Castilho [mailto:r...@apache.org] 
Sent: Tuesday, January 12, 2016 2:45 AM
To: user@uima.apache.org
Subject: Re: CAS serializationWithCompression

Is the CAS.size() larger than the serialized version or smaller?
What are you actually doing to the CAS? Just serializing/deserializing a
couple of times in a row, or do you actually add feature structures?
The sample code you show doesn't give any hint about where the CAS comes
from and what is being done with it.

-- Richard

> On 12.01.2016, at 03:06, D. Heinze  wrote:
> 
> I'm having a problem with CAS serializationWithCompression.  I am 
> processing a few million text document on an IBM P8 with 16 physical 
> SMTP 8 cpus, 200GB RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.
> 
> I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.
> 
> I use serializeWithCompression to save the final state of the 
> processing on each document to a file for later processing.
> 
> However, the size of the serialized CAS just keeps growing.  The size 
> of the CAS is stable, but the serialized CASes just keep getting 
> bigger. I even went to creating a new CAS for each process instead of 
> using cas.reset().  I have also tried writing the serialized CAS to a 
> byte array output stream first and then to a file, but it is the 
> serializeWithCompression that caused the size problem not writing the
file.
> 
> Here's what the code looks like.  Flushing or not flushing does not 
> make a difference.  Closing or not closing the file output strem does 
> not make a difference (other than leaking memory).  I've also tried 
> doing serializeWithCompression with type filtering.  Wanted to try 
> using a Marker, but cannot see how to do that.  The problem exists 
> regardless of doing 1 or
> 55 pipelines concurrently.
> 
> 
> 
>File fout = new File(documentPath);
> 
>fos = new FileOutputStream(fout);
> 
>
> org.apache.uima.cas.impl.Serialization.serializeWithCompression(
> cas, fos);
> 
>fos.flush();
> 
>fos.close();
> 
>logger.info( "serializedCas size " + cas.size() + " ToFile " + 
> documentPath);
> 
> 
> 
> Suggestions will be appreciated.
> 
> 
> 
> Thanks / Dan
> 
> 
> 



CAS serializationWithCompression

2016-01-11 Thread D. Heinze
I'm having a problem with CAS serializationWithCompression.  I am processing
a few million text document on an IBM P8 with 16 physical SMTP 8 cpus, 200GB
RAM, Ubuntu 14.04.3 LTS and IBM Java 1.8.

I run 55 UIMA pipelines concurrently.  I'm using UIMA 2.6.0.

I use serializeWithCompression to save the final state of the processing on
each document to a file for later processing.

However, the size of the serialized CAS just keeps growing.  The size of the
CAS is stable, but the serialized CASes just keep getting bigger. I even
went to creating a new CAS for each process instead of using cas.reset().  I
have also tried writing the serialized CAS to a byte array output stream
first and then to a file, but it is the serializeWithCompression that caused
the size problem not writing the file.

Here's what the code looks like.  Flushing or not flushing does not make a
difference.  Closing or not closing the file output strem does not make a
difference (other than leaking memory).  I've also tried doing
serializeWithCompression with type filtering.  Wanted to try using a Marker,
but cannot see how to do that.  The problem exists regardless of doing 1 or
55 pipelines concurrently.

 

File fout = new File(documentPath);

fos = new FileOutputStream(fout);

org.apache.uima.cas.impl.Serialization.serializeWithCompression(
cas, fos);

fos.flush();

fos.close();

logger.info( "serializedCas size " + cas.size() + " ToFile " +
documentPath);

 

Suggestions will be appreciated.

 

Thanks / Dan

 



UIMA-AS and ActiveMQ ports

2015-04-27 Thread D. Heinze
Does UIMA-AS have internal dependencies on ActiveMQ port 61616?  I can change 
my applications to use other ports, but it seems that 61616 still needs to be 
available for something in UIMA-AS.

Thanks / Dan




RE: DUCC web server interfacing

2014-11-21 Thread D. Heinze
Okay.  That makes sense.  From reading the DuccBook, I saw in chapter 5 on 
services "A service is one or more long-running processes that await requests 
from UIMA pipeline components and return something in response."  Makes it 
sound like Ducc services only service Ducc jobs, but if I can call services 
from components outside of Ducc control, that solves my problem.

Thanks / Dan

-Original Message-
From: Eddie Epstein [mailto:eaepst...@gmail.com] 
Sent: Friday, November 21, 2014 7:11 AM
To: user@uima.apache.org
Subject: Re: DUCC web server interfacing

On Thu, Nov 20, 2014 at 10:01 PM, D. Heinze  wrote:

> Eddie... thanks.  Yes, that sounds like I would not have the advantage 
> of DUCC managing the UIMA pipeline.
>

Depends on the definition of "managing". DUCC manages the lifecycle of analytic 
pipelines running as job processes and as services. There are differences in 
how DUCC decides how many instances of each are run. And you are right that 
only for jobs will DUCC send work items to the analytic pipeline.


>
> To break it down a little for the uninitiated (me),
>
>  1. how do I start a DUCC job that stays resident because it has high 
> startup cost (e.g. 2 minutes to load all the resources for the UIMA 
> pipeline VS about 2 seconds to process each request)?
>

Run the pipeline as a service. A service can be configured to start 
automatically, as soon as DUCC starts. If the load on the service increases, 
DUCC can be told [manually or programmatically] to launch additional service 
instances.


> 2. once I have a resident job, how do I get the Job Driver to 
> iteratively feed references to each next document (as they are 
> received) to the resident Job Process?  Because all the input jobs 
> will be archived anyhow, I'm okay with passing them through the file system 
> if needed.
>

The easiest approach is to have an application driver, say a web service, 
directly feed input to the service. If using references as input, the same 
analytic pipeline could be used both for live processing as a service and for 
batch job processing.

DUCC jobs are designed for batch work, where the size of the input collection 
is known and the number of job processes will be replicated as much as 
possible, given available resources and the job's fair share when multiple jobs 
are running.

DUCC services are intended to support job pipelines, for example a large memory 
but low latency analytic that can be shared by many job process instances, or 
for interactive applications.

Have you looked at creating a UIMA-AS service from a UIMA pipeline?

Eddie




RE: DUCC web server interfacing

2014-11-20 Thread D. Heinze
Eddie... thanks.  Yes, that sounds like I would not have the advantage of DUCC 
managing the UIMA pipeline. 

To break it down a little for the uninitiated (me), 

 1. how do I start a DUCC job that stays resident because it has high startup 
cost (e.g. 2 minutes to load all the resources for the UIMA pipeline VS about 2 
seconds to process each request)?

2. once I have a resident job, how do I get the Job Driver to iteratively feed 
references to each next document (as they are received) to the resident Job 
Process?  Because all the input jobs will be archived anyhow, I'm okay with 
passing them through the file system if needed.

Thanks / Dan

-Original Message-
From: Eddie Epstein [mailto:eaepst...@gmail.com] 
Sent: Thursday, November 20, 2014 6:06 PM
To: user@uima.apache.org
Subject: Re: DUCC web server interfacing

Ooops, in this case the web server would be feeding the service directly.

On Thu, Nov 20, 2014 at 9:04 PM, Eddie Epstein  wrote:

> The preferred approach is to run the analytics as a DUCC service, and 
> have an application driver that feeds the service instances with incoming 
> data.
> This service would be a scalable UIMA-AS service, which could have as 
> many instances as are needed to keep up with the load. The driver 
> would use the uima-as client API to feed the service. The application 
> driver could itself be another DUCC service.
>
> DUCC manages the life cycle of its services, including restarting them 
> on failure.
>
> Eddie
>
>
> On Thu, Nov 20, 2014 at 6:45 PM, Daniel Heinze  wrote:
>
>> I just installed DUCC this week and can process batch jobs.  I would 
>> like DUCC to initiate/manage one or more copies of the same UIMA 
>> pipeline that has high startup overhead and keep it/them active and 
>> feed it/them with documents that arrive periodically over a web 
>> service.  Any suggestions on the preferred way (if any) to do this in DUCC.
>>
>>
>>
>> Thanks / Dan
>>
>>
>