Re: Working with very large text documents

Thomas Ginter Fri, 18 Oct 2013 07:27:49 -0700

Armin,

It would probably be more efficient to have a CollectionReader that splits the 
log file so your not passing a gigantic file in RAM from the reader to the 
annotators before splitting it.  If it were me I would split the log file by 
days or hours with a max size that auto segments lines.  If your using UIMA-AS 
you can further scale your processing pipeline to increase throughput way 
beyond what CPE can provide.  Also with UIMA-AS it is easy to create a listener 
that gathers the aggregate processed data from the segments that are returned.


Thanks,

Thomas Ginter
801-448-7676
thomas.gin...@utah.edu




On Oct 18, 2013, at 7:58 AM, armin.weg...@bka.bund.de wrote:

> Dear Jens, dear Richard,
> 
> Looks like I have to use a log file specific pipeline. The problem was that I 
> did not knew it before the process crashed. It would be so nice having a 
> general approach.
> 
> Thanks,
> Armin
> 
> -----Ursprüngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:r...@apache.org] 
> Gesendet: Freitag, 18. Oktober 2013 12:32
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
> 
> Hi Armin,
> 
> that's a good point. It's also an issue with UIMA then, because the begin/end 
> offsets are likewise int values.
> 
> If it is a log file, couldn't you split it into sections of e.g.
> one CAS per day and analyze each one. If there are long-distance relations 
> that span days, you could add a second pass which reads in all analyzed cases 
> for a rolling window of e.g. 7 days and tries to find the long distance 
> relations in that window.
> 
> -- Richard
> 
> On 18.10.2013, at 10:48, armin.weg...@bka.bund.de wrote:
> 
>> Hi Richard,
>> 
>> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
>> 
>> Armin
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: Richard Eckart de Castilho [mailto:r...@apache.org]
>> Gesendet: Freitag, 18. Oktober 2013 10:43
>> An: user@uima.apache.org
>> Betreff: Re: Working with very large text documents
>> 
>> On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote:
>> 
>>> Hi,
>>> 
>>> What are you doing with very large text documents in an UIMA Pipeline, for 
>>> example 9 GB in size.
>> 
>> In that order of magnitude, I'd probably try to get a computer with 
>> more memory ;)
>> 
>>> A. I expect that you split the large file before putting it into the 
>>> pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, 
>>> where do you split the input file? You can not just split it anywhere. 
>>> There is a not so slight possibility to break the content. Is there a 
>>> preferred chunk size for UIMA?
>> 
>> The chunk size would likely not depend on UIMA, but rather on the machine 
>> you are using. If you cannot split the data in defined locations, maybe you 
>> can use a windowing approach where two splits have a certain overlap?
>> 
>>> B. Another possibility might be not to save the data in the CAS at all and 
>>> use an URI reference instead. It's up to the analysis engine then how to 
>>> load the data. My first idea was to use java.util.Scanner for regular 
>>> expressions for examples. But I think that you need to have the whole text 
>>> loaded to iterator over annotations. Or is just 
>>> AnnotationFS.getCoveredText() not working. Any suggestions here?
>> 
>> No idea unfortunately, never used the stream so far.
>> 
>> -- Richard
>> 
>> 
>

Re: Working with very large text documents

Reply via email to