Re: Working with very large text documents

Richard Eckart de Castilho Fri, 18 Oct 2013 03:38:45 -0700

Well, assuming this would e.g. be a server log, you could want to notice that 
some IP or set of IPs tried to log in with different user accounts across an 
extended period of time. So even if there is no linguistic relationship here, 
there is definitely a relationship that a security person would want to be able 
to discover. But that may be a secondary step after parsing the individual log 
lines.


-- Richard

On 18.10.2013, at 12:34, Jens Grivolla <j+...@grivolla.net> wrote:

> Ok, but then log files are usually very easy to split since they normally 
> consist of independent lines. So you could just have one document per day or 
> whatever gets it down to a reasonable size, without the risk of breaking 
> grammatical or semantic relationships.
> 
> On 10/18/2013 12:25 PM, Armin Wegner wrote:
>> Hi Jens,
>> 
>> It's a log file.
>> 
>> Cheers,
>> Armin
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: Jens Grivolla [mailto:j+...@grivolla.net]
>> Gesendet: Freitag, 18. Oktober 2013 11:05
>> An: user@uima.apache.org
>> Betreff: Re: Working with very large text documents
>> 
>> On 10/18/2013 10:06 AM, Armin Wegner wrote:
>> 
>>> What are you doing with very large text documents in an UIMA Pipeline, for 
>>> example 9 GB in size.
>> 
>> Just out of curiosity, how can you possibly have 9GB of text that represent 
>> one document? From a quick look at project gutenberg it seems that a full 
>> book with HTML markup is about 500kB to 1MB, so that's about a complete 
>> public library full of books.
>> 
>> Bye,
>> Jens

Re: Working with very large text documents

Reply via email to