AW: Working with very large text documents

Armin.Wegner Fri, 18 Oct 2013 01:49:58 -0700

Hi Richard,

As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.


Armin 

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
Gesendet: Freitag, 18. Oktober 2013 10:43
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:

> Hi,
> 
> What are you doing with very large text documents in an UIMA Pipeline, for 
> example 9 GB in size.

In that order of magnitude, I'd probably try to get a computer with more memory 
;) 

> A. I expect that you split the large file before putting it into the 
> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
> where do you split the input file? You can not just split it anywhere. There 
> is a not so slight possibility to break the content. Is there a preferred 
> chunk size for UIMA?

The chunk size would likely not depend on UIMA, but rather on the machine you 
are using. If you cannot split the data in defined locations, maybe you can use 
a windowing approach where two splits have a certain overlap?

> B. Another possibility might be not to save the data in the CAS at all and 
> use an URI reference instead. It's up to the analysis engine then how to load 
> the data. My first idea was to use java.util.Scanner for regular expressions 
> for examples. But I think that you need to have the whole text loaded to 
> iterator over annotations. Or is just AnnotationFS.getCoveredText() not 
> working. Any suggestions here?

No idea unfortunately, never used the stream so far.

-- Richard

pgpiw16fCOYg3.pgp
Description: PGP signature

AW: Working with very large text documents

Reply via email to