Working with very large text documents

2013-10-18 Thread Armin.Wegner
Hi,

What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.

A. I expect that you split the large file before putting it into the pipeline. 
Or do you use a multiplier in the pipeline to split it? Anyway, where do you 
split the input file? You can not just split it anywhere. There is a not so 
slight possibility to break the content. Is there a preferred chunk size for 
UIMA?

B. Another possibility might be not to save the data in the CAS at all and use 
an URI reference instead. It's up to the analysis engine then how to load the 
data. My first idea was to use java.util.Scanner for regular expressions for 
examples. But I think that you need to have the whole text loaded to iterator 
over annotations. Or is just AnnotationFS.getCoveredText() not working. Any 
suggestions here?

What are best practices for this problem?

Thanks,
Armin







pgpcJLMykvmQH.pgp
Description: PGP signature


Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:

 Hi,
 
 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.

In that order of magnitude, I'd probably try to get a computer with more memory 
;) 

 A. I expect that you split the large file before putting it into the 
 pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
 where do you split the input file? You can not just split it anywhere. There 
 is a not so slight possibility to break the content. Is there a preferred 
 chunk size for UIMA?

The chunk size would likely not depend on UIMA, but rather on the machine you 
are using. If you cannot split the data in defined locations, maybe you can use 
a windowing approach where two splits have a certain overlap?

 B. Another possibility might be not to save the data in the CAS at all and 
 use an URI reference instead. It's up to the analysis engine then how to load 
 the data. My first idea was to use java.util.Scanner for regular expressions 
 for examples. But I think that you need to have the whole text loaded to 
 iterator over annotations. Or is just AnnotationFS.getCoveredText() not 
 working. Any suggestions here?

No idea unfortunately, never used the stream so far.

-- Richard




Re: Working with very large text documents

2013-10-18 Thread Jens Grivolla

On 10/18/2013 10:06 AM, Armin Wegner wrote:


What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that 
represent one document? From a quick look at project gutenberg it seems 
that a full book with HTML markup is about 500kB to 1MB, so that's about 
a complete public library full of books.


Bye,
Jens



AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Hi Jens,

It's a log file.

Cheers,
Armin 

-Ursprüngliche Nachricht-
Von: Jens Grivolla [mailto:j+...@grivolla.net] 
Gesendet: Freitag, 18. Oktober 2013 11:05
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 10/18/2013 10:06 AM, Armin Wegner wrote:

 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.

Just out of curiosity, how can you possibly have 9GB of text that represent one 
document? From a quick look at project gutenberg it seems that a full book with 
HTML markup is about 500kB to 1MB, so that's about a complete public library 
full of books.

Bye,
Jens



pgpKE5ItZuOGF.pgp
Description: PGP signature


Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
Hi Armin,

that's a good point. It's also an issue with UIMA then, because
the begin/end offsets are likewise int values.

If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance
relations that span days, you could add a second pass which
reads in all analyzed cases for a rolling window of e.g. 7 days
and tries to find the long distance relations in that window.

-- Richard

On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:

 Hi Richard,
 
 As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
 
 Armin 
 
 -Ursprüngliche Nachricht-
 Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
 Gesendet: Freitag, 18. Oktober 2013 10:43
 An: user@uima.apache.org
 Betreff: Re: Working with very large text documents
 
 On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
 
 Hi,
 
 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.
 
 In that order of magnitude, I'd probably try to get a computer with more 
 memory ;) 
 
 A. I expect that you split the large file before putting it into the 
 pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
 where do you split the input file? You can not just split it anywhere. There 
 is a not so slight possibility to break the content. Is there a preferred 
 chunk size for UIMA?
 
 The chunk size would likely not depend on UIMA, but rather on the machine you 
 are using. If you cannot split the data in defined locations, maybe you can 
 use a windowing approach where two splits have a certain overlap?
 
 B. Another possibility might be not to save the data in the CAS at all and 
 use an URI reference instead. It's up to the analysis engine then how to 
 load the data. My first idea was to use java.util.Scanner for regular 
 expressions for examples. But I think that you need to have the whole text 
 loaded to iterator over annotations. Or is just 
 AnnotationFS.getCoveredText() not working. Any suggestions here?
 
 No idea unfortunately, never used the stream so far.
 
 -- Richard
 
 



Re: AW: Working with very large text documents

2013-10-18 Thread Jens Grivolla
Ok, but then log files are usually very easy to split since they 
normally consist of independent lines. So you could just have one 
document per day or whatever gets it down to a reasonable size, without 
the risk of breaking grammatical or semantic relationships.


On 10/18/2013 12:25 PM, Armin Wegner wrote:

Hi Jens,

It's a log file.

Cheers,
Armin

-Ursprüngliche Nachricht-
Von: Jens Grivolla [mailto:j+...@grivolla.net]
Gesendet: Freitag, 18. Oktober 2013 11:05
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 10/18/2013 10:06 AM, Armin Wegner wrote:


What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that represent one 
document? From a quick look at project gutenberg it seems that a full book with 
HTML markup is about 500kB to 1MB, so that's about a complete public library 
full of books.

Bye,
Jens






Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
Well, assuming this would e.g. be a server log, you could want to notice that 
some IP or set of IPs tried to log in with different user accounts across an 
extended period of time. So even if there is no linguistic relationship here, 
there is definitely a relationship that a security person would want to be able 
to discover. But that may be a secondary step after parsing the individual log 
lines.

-- Richard

On 18.10.2013, at 12:34, Jens Grivolla j+...@grivolla.net wrote:

 Ok, but then log files are usually very easy to split since they normally 
 consist of independent lines. So you could just have one document per day or 
 whatever gets it down to a reasonable size, without the risk of breaking 
 grammatical or semantic relationships.
 
 On 10/18/2013 12:25 PM, Armin Wegner wrote:
 Hi Jens,
 
 It's a log file.
 
 Cheers,
 Armin
 
 -Ursprüngliche Nachricht-
 Von: Jens Grivolla [mailto:j+...@grivolla.net]
 Gesendet: Freitag, 18. Oktober 2013 11:05
 An: user@uima.apache.org
 Betreff: Re: Working with very large text documents
 
 On 10/18/2013 10:06 AM, Armin Wegner wrote:
 
 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.
 
 Just out of curiosity, how can you possibly have 9GB of text that represent 
 one document? From a quick look at project gutenberg it seems that a full 
 book with HTML markup is about 500kB to 1MB, so that's about a complete 
 public library full of books.
 
 Bye,
 Jens


AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Dear Jens, dear Richard,

Looks like I have to use a log file specific pipeline. The problem was that I 
did not knew it before the process crashed. It would be so nice having a 
general approach.

Thanks,
Armin

-Ursprüngliche Nachricht-
Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
Gesendet: Freitag, 18. Oktober 2013 12:32
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

Hi Armin,

that's a good point. It's also an issue with UIMA then, because the begin/end 
offsets are likewise int values.

If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance relations that 
span days, you could add a second pass which reads in all analyzed cases for a 
rolling window of e.g. 7 days and tries to find the long distance relations in 
that window.

-- Richard

On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:

 Hi Richard,
 
 As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
 
 Armin
 
 -Ursprüngliche Nachricht-
 Von: Richard Eckart de Castilho [mailto:r...@apache.org]
 Gesendet: Freitag, 18. Oktober 2013 10:43
 An: user@uima.apache.org
 Betreff: Re: Working with very large text documents
 
 On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
 
 Hi,
 
 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.
 
 In that order of magnitude, I'd probably try to get a computer with 
 more memory ;)
 
 A. I expect that you split the large file before putting it into the 
 pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
 where do you split the input file? You can not just split it anywhere. There 
 is a not so slight possibility to break the content. Is there a preferred 
 chunk size for UIMA?
 
 The chunk size would likely not depend on UIMA, but rather on the machine you 
 are using. If you cannot split the data in defined locations, maybe you can 
 use a windowing approach where two splits have a certain overlap?
 
 B. Another possibility might be not to save the data in the CAS at all and 
 use an URI reference instead. It's up to the analysis engine then how to 
 load the data. My first idea was to use java.util.Scanner for regular 
 expressions for examples. But I think that you need to have the whole text 
 loaded to iterator over annotations. Or is just 
 AnnotationFS.getCoveredText() not working. Any suggestions here?
 
 No idea unfortunately, never used the stream so far.
 
 -- Richard
 
 



pgpdxxIbIcht0.pgp
Description: PGP signature


Re: Working with very large text documents

2013-10-18 Thread Thomas Ginter
Armin,

It would probably be more efficient to have a CollectionReader that splits the 
log file so your not passing a gigantic file in RAM from the reader to the 
annotators before splitting it.  If it were me I would split the log file by 
days or hours with a max size that auto segments lines.  If your using UIMA-AS 
you can further scale your processing pipeline to increase throughput way 
beyond what CPE can provide.  Also with UIMA-AS it is easy to create a listener 
that gathers the aggregate processed data from the segments that are returned.

Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu




On Oct 18, 2013, at 7:58 AM, armin.weg...@bka.bund.de wrote:

 Dear Jens, dear Richard,
 
 Looks like I have to use a log file specific pipeline. The problem was that I 
 did not knew it before the process crashed. It would be so nice having a 
 general approach.
 
 Thanks,
 Armin
 
 -Ursprüngliche Nachricht-
 Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
 Gesendet: Freitag, 18. Oktober 2013 12:32
 An: user@uima.apache.org
 Betreff: Re: Working with very large text documents
 
 Hi Armin,
 
 that's a good point. It's also an issue with UIMA then, because the begin/end 
 offsets are likewise int values.
 
 If it is a log file, couldn't you split it into sections of e.g.
 one CAS per day and analyze each one. If there are long-distance relations 
 that span days, you could add a second pass which reads in all analyzed cases 
 for a rolling window of e.g. 7 days and tries to find the long distance 
 relations in that window.
 
 -- Richard
 
 On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:
 
 Hi Richard,
 
 As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
 
 Armin
 
 -Ursprüngliche Nachricht-
 Von: Richard Eckart de Castilho [mailto:r...@apache.org]
 Gesendet: Freitag, 18. Oktober 2013 10:43
 An: user@uima.apache.org
 Betreff: Re: Working with very large text documents
 
 On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
 
 Hi,
 
 What are you doing with very large text documents in an UIMA Pipeline, for 
 example 9 GB in size.
 
 In that order of magnitude, I'd probably try to get a computer with 
 more memory ;)
 
 A. I expect that you split the large file before putting it into the 
 pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
 where do you split the input file? You can not just split it anywhere. 
 There is a not so slight possibility to break the content. Is there a 
 preferred chunk size for UIMA?
 
 The chunk size would likely not depend on UIMA, but rather on the machine 
 you are using. If you cannot split the data in defined locations, maybe you 
 can use a windowing approach where two splits have a certain overlap?
 
 B. Another possibility might be not to save the data in the CAS at all and 
 use an URI reference instead. It's up to the analysis engine then how to 
 load the data. My first idea was to use java.util.Scanner for regular 
 expressions for examples. But I think that you need to have the whole text 
 loaded to iterator over annotations. Or is just 
 AnnotationFS.getCoveredText() not working. Any suggestions here?
 
 No idea unfortunately, never used the stream so far.
 
 -- Richard
 
 
 



Re: AW: Working with very large text documents

2013-10-18 Thread Thilo Goetz
Don't you have a hadoop cluster you can use?  Hadoop would handle the 
file splitting for you, and if your UIMA analysis is well-behaved, you 
can deploy it as a M/R job, one record at a time.


--Thilo

On 10/18/2013 12:25 PM, armin.weg...@bka.bund.de wrote:

Hi Jens,

It's a log file.

Cheers,
Armin

-Ursprüngliche Nachricht-
Von: Jens Grivolla [mailto:j+...@grivolla.net]
Gesendet: Freitag, 18. Oktober 2013 11:05
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 10/18/2013 10:06 AM, Armin Wegner wrote:


What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that represent one 
document? From a quick look at project gutenberg it seems that a full book with 
HTML markup is about 500kB to 1MB, so that's about a complete public library 
full of books.

Bye,
Jens





Re: XmiCasSerializer error in UIMA-AS

2013-10-18 Thread Eddie Epstein
Is this a solid error that is easily reproduced?

The error is occurring when UIMA-AS is returning the CAS from the service.
You could add XMI serialization to file at the end of AE processing, for
the good and failing cases. If so lucky to have that serialization fail
too, could try inserting the serialization at points earlier.

Note that the UIMA-AS serialization is only serializing the delta changes
from the input CAS, different from what you would do.

Eddie



On Thu, Oct 17, 2013 at 12:19 PM, Prokopis Prokopidis proko...@ilsp.grwrote:

 Hi all,

 I have an AE that produces the error below when deployed as a UIMA-AS
 2.4.0 service. The same AE as part of a UIMA 2.4.2 CPE or a uimafit 2.*
 pipeline does not produce any errors and works as expected.

 Among other things, this AE uses ruta rules to process the CAS. When the
 rules are not used,  the AE works as expected in both UIMA and UIMA-AS.

 I have tried to log all annotations generated by the AE when the rules are
 used and just before the AE processing is finished. The annotations seem
 the same in both the UIMA and the UIMA-AS processing scenarios.

 Does anyone have hints on what the cause of this might be or how I should
 proceed in debugging?

 Many thanks in advance,

 Prokopis

 WARNING:
 java.lang.**ArrayIndexOutOfBoundsException
 at org.apache.uima.internal.util.**IntVector.remove(IntVector.**
 java:207)
 at org.apache.uima.internal.util.**IntSet.remove(IntSet.java:77)
 at org.apache.uima.cas.impl.**FSIndexRepositoryImpl.**
 processIndexUpdates(**FSIndexRepositoryImpl.java:**1756)
 at org.apache.uima.cas.impl.**FSIndexRepositoryImpl.**isModified(*
 *FSIndexRepositoryImpl.java:**1800)
 at org.apache.uima.cas.impl.**XmiCasSerializer$**
 XmiCasDocSerializer.serialize(**XmiCasSerializer.java:256)
 at org.apache.uima.cas.impl.**XmiCasSerializer$**
 XmiCasDocSerializer.access$**700(XmiCasSerializer.java:108)
 at org.apache.uima.cas.impl.**XmiCasSerializer.serialize(**
 XmiCasSerializer.java:1566)
 at org.apache.uima.aae.**UimaSerializer.**serializeCasToXmi(**
 UimaSerializer.java:160)
 at org.apache.uima.adapter.jms.**activemq.JmsOutputChannel.**
 serializeCAS(JmsOutputChannel.**java:237)
 at org.apache.uima.adapter.jms.**activemq.JmsOutputChannel.**
 getSerializedCas(**JmsOutputChannel.java:1223)
 at org.apache.uima.adapter.jms.**activemq.JmsOutputChannel.**
 sendReply(JmsOutputChannel.**java:786)
 at org.apache.uima.aae.**controller.**
 PrimitiveAnalysisEngineControl**ler_impl.process(**
 PrimitiveAnalysisEngineControl**ler_impl.java:1036)
 at org.apache.uima.aae.handler.**HandlerBase.invokeProcess(**
 HandlerBase.java:121)
 at org.apache.uima.aae.handler.**input.ProcessRequestHandler_**
 impl.**handleProcessRequestFromRemote**Client(ProcessRequestHandler_**
 impl.java:542)
 at org.apache.uima.aae.handler.**input.ProcessRequestHandler_**
 impl.handle(**ProcessRequestHandler_impl.**java:1041)
 at org.apache.uima.aae.handler.**input.MetadataRequestHandler_**
 impl.handle(**MetadataRequestHandler_impl.**java:78)
 at org.apache.uima.adapter.jms.**activemq.JmsInputChannel.**
 onMessage(JmsInputChannel.**java:706)
 at org.springframework.jms.**listener.**
 AbstractMessageListenerContain**er.doInvokeListener(**
 AbstractMessageListenerContain**er.java:535)
 at org.springframework.jms.**listener.**
 AbstractMessageListenerContain**er.invokeListener(**
 AbstractMessageListenerContain**er.java:495)
 at org.springframework.jms.**listener.**
 AbstractMessageListenerContain**er.doExecuteListener(**
 AbstractMessageListenerContain**er.java:467)
 at org.springframework.jms.**listener.**
 AbstractPollingMessageListener**Container.doReceiveAndExecute(**
 AbstractPollingMessageListener**Container.java:325)
 at org.springframework.jms.**listener.**
 AbstractPollingMessageListener**Container.receiveAndExecute(**
 AbstractPollingMessageListener**Container.java:263)
 at org.springframework.jms.**listener.**
 DefaultMessageListenerContaine**r$AsyncMessageListenerInvoker.**
 invokeListener(**DefaultMessageListenerContaine**r.java:1058)
 at org.springframework.jms.**listener.**
 DefaultMessageListenerContaine**r$AsyncMessageListenerInvoker.**run(**
 DefaultMessageListenerContaine**r.java:952)
 at java.util.concurrent.**ThreadPoolExecutor.runWorker(**
 ThreadPoolExecutor.java:1145)
 at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
 ThreadPoolExecutor.java:615)
 at org.apache.uima.aae.**UimaAsThreadFactory$1.run(**
 UimaAsThreadFactory.java:118)
 at java.lang.Thread.run(Thread.**java:724)