AW: Working with very large text documents

Armin.Wegner Fri, 18 Oct 2013 06:59:57 -0700

Dear Jens, dear Richard,

Looks like I have to use a log file specific pipeline. The problem was that I 
did not knew it before the process crashed. It would be so nice having a 
general approach.


Thanks,
Armin

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
Gesendet: Freitag, 18. Oktober 2013 12:32
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

Hi Armin,

that's a good point. It's also an issue with UIMA then, because the begin/end 
offsets are likewise int values.

If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance relations that 
span days, you could add a second pass which reads in all analyzed cases for a 
rolling window of e.g. 7 days and tries to find the long distance relations in 
that window.

-- Richard

On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:

> Hi Richard,
> 
> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
> 
> Armin
> 
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:r...@apache.org]
> Gesendet: Freitag, 18. Oktober 2013 10:43
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
> 
> On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
> 
>> Hi,
>> 
>> What are you doing with very large text documents in an UIMA Pipeline, for 
>> example 9 GB in size.
> 
> In that order of magnitude, I'd probably try to get a computer with 
> more memory ;)
> 
>> A. I expect that you split the large file before putting it into the 
>> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
>> where do you split the input file? You can not just split it anywhere. There 
>> is a not so slight possibility to break the content. Is there a preferred 
>> chunk size for UIMA?
> 
> The chunk size would likely not depend on UIMA, but rather on the machine you 
> are using. If you cannot split the data in defined locations, maybe you can 
> use a windowing approach where two splits have a certain overlap?
> 
>> B. Another possibility might be not to save the data in the CAS at all and 
>> use an URI reference instead. It's up to the analysis engine then how to 
>> load the data. My first idea was to use java.util.Scanner for regular 
>> expressions for examples. But I think that you need to have the whole text 
>> loaded to iterator over annotations. Or is just 
>> AnnotationFS.getCoveredText() not working. Any suggestions here?
> 
> No idea unfortunately, never used the stream so far.
> 
> -- Richard
> 
>

pgpdxxIbIcht0.pgp
Description: PGP signature

AW: Working with very large text documents

Reply via email to