Re: Working with very large text documents

Richard Eckart de Castilho Fri, 18 Oct 2013 03:32:28 -0700

Hi Armin,

that's a good point. It's also an issue with UIMA then, because
the begin/end offsets are likewise int values.


If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance
relations that span days, you could add a second pass which
reads in all analyzed cases for a rolling window of e.g. 7 days
and tries to find the long distance relations in that window.

-- Richard

On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:

> Hi Richard,
> 
> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
> 
> Armin 
> 
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
> Gesendet: Freitag, 18. Oktober 2013 10:43
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
> 
> On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
> 
>> Hi,
>> 
>> What are you doing with very large text documents in an UIMA Pipeline, for 
>> example 9 GB in size.
> 
> In that order of magnitude, I'd probably try to get a computer with more 
> memory ;) 
> 
>> A. I expect that you split the large file before putting it into the 
>> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
>> where do you split the input file? You can not just split it anywhere. There 
>> is a not so slight possibility to break the content. Is there a preferred 
>> chunk size for UIMA?
> 
> The chunk size would likely not depend on UIMA, but rather on the machine you 
> are using. If you cannot split the data in defined locations, maybe you can 
> use a windowing approach where two splits have a certain overlap?
> 
>> B. Another possibility might be not to save the data in the CAS at all and 
>> use an URI reference instead. It's up to the analysis engine then how to 
>> load the data. My first idea was to use java.util.Scanner for regular 
>> expressions for examples. But I think that you need to have the whole text 
>> loaded to iterator over annotations. Or is just 
>> AnnotationFS.getCoveredText() not working. Any suggestions here?
> 
> No idea unfortunately, never used the stream so far.
> 
> -- Richard
> 
>

Re: Working with very large text documents

Reply via email to