Working with very large text documents

2013-10-18 Thread Armin.Wegner
Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. A. I expect that you split the large file before putting it into the pipeline. Or do you use a multiplier in the pipeline to split it? Anyway, where do you split the input file? You can not

Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
On 18.10.2013, at 10:06, armin.weg...@bka.bund.de wrote: Hi, What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. In that order of magnitude, I'd probably try to get a computer with more memory ;) A. I expect that you split the large file

Re: Working with very large text documents

2013-10-18 Thread Jens Grivolla
On 10/18/2013 10:06 AM, Armin Wegner wrote: What are you doing with very large text documents in an UIMA Pipeline, for example 9 GB in size. Just out of curiosity, how can you possibly have 9GB of text that represent one document? From a quick look at project gutenberg it seems that a full

AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Hi Jens, It's a log file. Cheers, Armin -Ursprüngliche Nachricht- Von: Jens Grivolla [mailto:j+...@grivolla.net] Gesendet: Freitag, 18. Oktober 2013 11:05 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 10/18/2013 10:06 AM, Armin Wegner wrote: What

Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/end offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations that span days, you could add a

Re: AW: Working with very large text documents

2013-10-18 Thread Jens Grivolla
Ok, but then log files are usually very easy to split since they normally consist of independent lines. So you could just have one document per day or whatever gets it down to a reasonable size, without the risk of breaking grammatical or semantic relationships. On 10/18/2013 12:25 PM, Armin

Re: Working with very large text documents

2013-10-18 Thread Richard Eckart de Castilho
Well, assuming this would e.g. be a server log, you could want to notice that some IP or set of IPs tried to log in with different user accounts across an extended period of time. So even if there is no linguistic relationship here, there is definitely a relationship that a security person

AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that I did not knew it before the process crashed. It would be so nice having a general approach. Thanks, Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho

Re: Working with very large text documents

2013-10-18 Thread Thomas Ginter
Armin, It would probably be more efficient to have a CollectionReader that splits the log file so your not passing a gigantic file in RAM from the reader to the annotators before splitting it. If it were me I would split the log file by days or hours with a max size that auto segments lines.

Re: AW: Working with very large text documents

2013-10-18 Thread Thilo Goetz
Don't you have a hadoop cluster you can use? Hadoop would handle the file splitting for you, and if your UIMA analysis is well-behaved, you can deploy it as a M/R job, one record at a time. --Thilo On 10/18/2013 12:25 PM, armin.weg...@bka.bund.de wrote: Hi Jens, It's a log file. Cheers,

Re: XmiCasSerializer error in UIMA-AS

2013-10-18 Thread Eddie Epstein
Is this a solid error that is easily reproduced? The error is occurring when UIMA-AS is returning the CAS from the service. You could add XMI serialization to file at the end of AE processing, for the good and failing cases. If so lucky to have that serialization fail too, could try inserting the