Re: AW: Working with very large text documents

Jens Grivolla Fri, 18 Oct 2013 03:35:45 -0700

Ok, but then log files are usually very easy to split since theynormally consist of independent lines. So you could just have onedocument per day or whatever gets it down to a reasonable size, withoutthe risk of breaking grammatical or semantic relationships.


On 10/18/2013 12:25 PM, Armin Wegner wrote:

Hi Jens,


It's a log file.

Cheers,
Armin

-----Ursprüngliche Nachricht-----
Von: Jens Grivolla [mailto:j+...@grivolla.net]
Gesendet: Freitag, 18. Oktober 2013 11:05
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

On 10/18/2013 10:06 AM, Armin Wegner wrote:

What are you doing with very large text documents in an UIMA Pipeline, for 
example 9 GB in size.


Just out of curiosity, how can you possibly have 9GB of text that represent one 
document? From a quick look at project gutenberg it seems that a full book with 
HTML markup is about 500kB to 1MB, so that's about a complete public library 
full of books.

Bye,
Jens

Re: AW: Working with very large text documents

Reply via email to