AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Hi Jens, It's a log file. Cheers, Armin -Ursprüngliche Nachricht- Von: Jens Grivolla [mailto:j+...@grivolla.net] Gesendet: Freitag, 18. Oktober 2013 11:05 An: user@uima.apache.org Betreff: Re: Working with very large text documents On 10/18/2013 10:06 AM, Armin Wegner wrote: What

Re: AW: Working with very large text documents

2013-10-18 Thread Jens Grivolla
Ok, but then log files are usually very easy to split since they normally consist of independent lines. So you could just have one document per day or whatever gets it down to a reasonable size, without the risk of breaking grammatical or semantic relationships. On 10/18/2013 12:25 PM, Armin

AW: Working with very large text documents

2013-10-18 Thread Armin.Wegner
Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that I did not knew it before the process crashed. It would be so nice having a general approach. Thanks, Armin -Ursprüngliche Nachricht- Von: Richard Eckart de Castilho

Re: AW: Working with very large text documents

2013-10-18 Thread Thilo Goetz
Don't you have a hadoop cluster you can use? Hadoop would handle the file splitting for you, and if your UIMA analysis is well-behaved, you can deploy it as a M/R job, one record at a time. --Thilo On 10/18/2013 12:25 PM, armin.weg...@bka.bund.de wrote: Hi Jens, It's a log file. Cheers,